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1  Executive  Summary 


The  aim  of  the  PACE  projeet  was  to  reduee  the  energy  eonsumption  of  mieroproeessors  by  exploiting  eompile- 
time  knowledge  to  reduee  run-time  switehing  aetivity  and  to  power  down  unneeded  bloeks.  The  projeet  had 
two  phases.  The  first  phase  foeused  on  understanding  and  redueing  power  eonsumption  within  mieroproeessor 
eomponents,  sueh  as  eaehes,  register  files,  and  arifhmefie  unifs.  Several  new  feehniques  were  developed  fo  reduee 
bofh  swifehing  and  leakage  power.  The  seeond  phase  developed  a  new  energy-exposed  mieroproeessor  arehifeefure, 
SCALE  (Soflware-Conlrolled  Arehifeefure  for  Low  Energy).  SCALE  is  based  on  a  new  vector-thread  arehifeefural 
paradigm  whieh  unifies  fhe  veefor  and  fhreaded  exeeufion  models,  fo  provide  effieienl  exeeufion  of  many  forms  of 
parallelism. 

2  Approach 

The  projeef  developed  a  highly  parallel  mieroproeessor  arehifeefure,  SCALE,  fhaf  is  sfruefured  as  an  array  of 
processing  files.  Each  file  confains  bofh  processing  and  memory  resources  and  fhe  files  communicate  wifh  each 
ofher  and  off-chip  devices  over  an  on-chip  communicafions  nefwork.  This  filed  sfrucfure  provides  bofh  high 
performance  and  low  energy  consumpfion  by  allowing  disfribufed  parallel  compulations  on  local  dafa.  Soflware 
can  frade  energy  and  performance  by  varying  fhe  number  of  files  allocated  fo  a  fask.  In  addifion,  each  tile  has 
an  unprecedenfed  level  of  fine-grain  soflware  power  conlrol  fo  enable  deaclivalion  of  unneeded  microarchilecfural 
componenls. 

Modern  inslruclion  sel  archilecfures  (ISAs),  such  as  RISC  and  VLIW  machines,  provide  a  hardware-soflware 
interface  designed  solely  for  maximum  performance  wifh  minimum  hardware  complexify.  Compared  wifh  applicalion- 
specific  cuslom  circuilry,  Ihese  general  purpose  processors  exhibil  a  factor  of  100-1000  worse  energy-delay  prod- 
ucl.  This  projeef  worked  on  reducing  Ihis  gap  by  re-examining  fhe  hardware-soflware  interface,  only  now  consid¬ 
ering  bofh  performance  and  energy  consumpfion.  The  approach  was  fo  co-develop  new  machine  archilecfures  lhal 
expose  energy  consumpfion  fo  soflware  logelher  wifh  new  compilalion  lechnology  fhaf  can  communicate  energy¬ 
saving  compile-lime  knowledge  fo  fhe  hardware.  The  resull  was  fhe  SCALE  archileclure,  which  inlroduces  a  new 
veclor-lhread  archileclural  paradigm  lhal  provides  high  performance  al  low  power  for  many  forms  of  application 
parallelism. 

The  initial  phase  of  fhe  projeef  examined  fhe  power  consumption  in  various  microarchifeclural  componenls. 

We  developed  a  number  of  power  saving  feehniques  al  fhe  microarchilecfural  level,  and  gained  insighl  into  where 
soflware  could  besl  help  reduce  power  Ihrough  fhe  inslruclion  sel  level. 
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To  help  evaluate  the  approaeh,  a  fast  and  aeeurate  energy-performanee  simulation  framework  (SyCHOSys) 
was  developed  that  enables  simulation  of  eomplete  mieroproeessor  designs  running  large  seale  applieations  while 
gathering  detailed  energy  statisties.  This  simulator  extends  the  state  of  the  art  by  enabling  aeeurate  (<  10%  error) 
cyele-by-eyele  energy  charaeterization  for  billions  of  eyeles  of  simulated  CPU  aetivity. 

The  eompiler  researeh  in  this  projeet  leveraged  two  existing  sophistieated  optimizing  eompiler  infrastruetures 
developed  at  MIT:  the  RAW  FORTRAN  and  C  eompiler  and  the  FLEX  Java  eompiler.  These  were  enhaneed  and 
extended  to  extraet  eompile-time  knowledge  to  reduee  mieroproeessor  power. 

3  Accomplishments 

SyCHOSys  Power-Performance  Simulator 

We  developed  a  eompiled  energy-performanee  simulator  [1].  This  simulator  traeks  the  energy  eonsumption  for 
eaeh  individual  signal  within  a  proeessor  with  less  than  10%  error  of  a  full  SPICE-level  eircuit  simulation,  but  is 
fast  enough  to  simulate  several  billion  eyeles  of  applieation  eode  in  a  single  day  on  a  eommereial  workstation. 

We  used  the  simulation  to  determine  the  energy-eonsumption  within  a  eomplete  low-power  mieroproeessor 
arehiteeture  [2]  running  a  range  of  applieation  benehmarks.  Results  obtained  illustrate  areas  that  require  further 
energy  savings  after  eommon  low-power  optimizations  are  applied.  This  simulator  framework  was  used  for  many 
of  the  following  studies. 

Activity- Sensitive  Flip-Flops 

Eatehes  and  flip-flops  are  important  eomponents  of  total  power  dissipation.  We  developed  a  new  activity-sensitive 
flip-flop  design  methodology  whieh  reduees  flip-flop  and  lateh  energy  by  up  to  60%  with  no  speed  penalty  by  using 
detailed  knowledge  of  the  expeeted  data  and  eloek  aetivity  for  eaeh  register  [3]. 

We  also  investigated  the  effeet  of  loading  on  flip-flop  power  eonsumption,  and  showed  that  the  relative  energy- 
delay  performanee  of  various  flip-flop  designs  ehanges  as  both  absolute  output  load  and  input-to-output  load  ratio 
are  varied  [4]. 

Cache  and  Register  File  Optimizations 

In  the  first  phase  of  the  projeet,  we  developed  a  number  of  teehniques  to  reduee  energy  in  the  eaehes  and  register 
files  of  proeessors. 

Way-memorization  avoids  eaehe  tag  lookups  by  building  direet  links  within  the  instruetion  eaehe.  This  removes 
97%  of  instruetion  eaehe  tag  lookups,  saving  23%  of  I-eaehe  energy  [5]. 

We  developed  a  new  dynamic  cache  resizing  teehnique  that  adapts  aetive  eaehe  size  to  applieation  needs  to 
reduee  switehing  and  leakage  power  in  highly-assoeiative  eaehes.  This  teehnique  typieally  reduees  aetive  eaehe 
size  and  power  by  one  half  with  minimal  impaet  on  performanee  [6]. 

To  reduee  register  file  energy,  we  developed  a  banked  register  file  seheme  with  a  simple  speeulative  eontrol 
seheme  [7].  This  redueed  register  file  size  by  a  faetor  of  three  and  aeeess  energy  by  40%. 

Fine-Grain  Leakage  Reduction 

Eeakage  current  is  a  growing  concern  as  threshold  voltages  are  scaled  down.  We  have  developed  circuits  and 
microarchitectures  for  fine-grain  dynamic  leakage  reduction,  which  allow  small  portions  of  an  active  processor 
to  be  powered  down  for  a  short  period  of  time  to  save  static  leakage  power.  Our  techniques  use  leakage-biased 
circuits,  where  leakage  currents  themselves  are  used  to  bias  circuits  into  a  low-leakage  state.  Savings  of  over  57% 
of  overall  active  power  were  estimated  for  a  multiported  register  file,  with  no  performance  loss  [8]. 

We  have  also  developed  a  high-performance  leakage-biased  domino  circuit  style,  which  reduces  standby  leak¬ 
age  by  a  factor  of  100  compared  to  dual-Vt  domino  [9],  at  the  same  delay. 
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Activity  Migration 

Power  dissipation  is  distributed  unevenly  over  the  surfaee  of  a  mieroproeessor,  leading  to  loeal  temperature  “hot¬ 
spots”,  whieh  limit  sustainable  power  dissipation  and  reduee  reliability. 

We  developed  the  teehnique  of  activity  migration  to  reduee  power  density  in  mieroproeessors.  Aetivity  migra¬ 
tion  reduees  die  temperature  by  moving  eomputation  between  multiple  redundant  eireuits  as  eaeh  one  heats  up. 
The  drop  in  die  temperature  reduees  leakage  eurrent  by  up  to  35%  and  inereases  transistor  speed  by  up  to  16%  [10]. 

Heads-and- Tails  Variable  Length  Instruction  Encoding 

We  developed  the  heads-and-tails  format,  whieh  simplifies  pipelined  or  supersealar  instruetion  feteh  and  deeode 
of  a  dense  variable-length  instruetion  format.  For  RISC  proeessors  a  25%  reduetion  in  eode  size  was  aehieved,  for 
VLIW  proeessors  a  40%  reduetion  in  statie  eode  size  was  aehieved  [11,  12].  Redueed  eode  size  provides  better  hit 
ratios  in  small  low-power  eaehes. 

Energy-Exposed  Instruction  Sets 

The  seeond  phase  of  the  projeet  foeused  on  how  eompile-time  knowledge  eould  reduee  energy  eonsumption  at  run 
time.  We  developed  several  eomplementary  ideas  in  energy-exposed  instruetion  sets  [13]. 

Inside  eurrent  mieroproeessors,  there  is  eonsiderable  mieroarehiteetural  overhead  in  support  preeise  exeeptions 
on  every  instruetion.  Using  software  restart  markers  we  ean  shift  some  of  this  burden  to  the  eompiler,  by  only 
marking  eertain  instruetions  as  requiring  preeise  exeeption  semanties.  We  implemented  eompiler  passes  in  both 
C  and  Java  and  determined  we  eould  remove  around  60%  of  exeeption  points  in  eode  using  only  a  simple  loeal 
analysis  [14,  13]. 

The  eompiler  is  responsible  for  register  alloeation,  and  this  information  ean  be  used  to  reduee  register  file 
Iraffie.  We  developed  a  hybrid  aeeumulafor-RISC  arehifeefure  fhaf  allows  soflware  fo  manage  fhe  bypass  lafehes 
direcfly,  and  implemenfed  eompiler  passes  fhaf  removed  up  fo  36%  of  regisfer  file  reads  and  up  fo  47%  of  regisfer 
file  writes  in  C  and  Java  programs  [14,  13]. 

We  also  developed  fhe  direef-addressed  eaehe,  a  eombined  hardware  and  soflware  seheme  fhaf  uses  eompile- 
lime  knowledge  fo  remove  up  fo  70%  of  dafa  eaehe  lag  eheeks  al  run-lime  [15]. 

SCALE  Vector- Thread  Architecture 

The  SCALE  arehifeefure  builds  upon  fhe  experienee  gained  in  fhe  firsl  phase  in  fhe  projeel.  SCALE  is  based  around 
an  energy-exposed  inslruelion  and  inlroduees  a  new  arehileelural  paradigm,  vector  threading.  The  veelor-lhread 
arehifeefure  unifies  veelor  and  Ihreaded  parallel  exeeufion  models  fo  give  high  performanee  on  a  wide  range  of 
appliealions  [16]. 

An  inslruelion-level  simulator  and  a  delailed  mieroarehileelural-level  eyele  simulafor  have  been  eompleled  for 
SCALE. 

We  are  eonfinuing  to  eomplele  a  prololype  implemenlafion  of  fhe  SCALE  arehifeefure  in  olher  work. 

Mondriaan  Memory  Protection 

A  new  fine-grained  memory  proleelion  system,  Mondriaan  Memory  protection,  was  developed  as  an  offshool  of 
fhe  soflware-eonfrolled  low-power  eaehe  design  [17,  18,  19].  This  seheme  provides  effieienl  hardware  memory 
proleelion  fo  improve  syslem  robuslness. 

A  palenf  has  been  filed  for  Ihis  teehnique. 


4  Technology  transition 

Numerous  leehnology  fransilion  palhs  are  being  pursued  to  Iransfer  resulls  to  indusfrial  parlners. 
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Activity-sensitive  Flip-Flops  and  Latches 

The  activity-sensitive  flip-flop  and  latch  methodology  has  been  transferred  to  the  Desktop  Products  Group  at  Intel 
Corporation,  where  it  was  evaluated  and  cleared  for  use  in  product  development. 

Heads  and  Tails  Instruction  Compression 

A  collaboration  with  Paolo  Faraboschi  and  Josh  Fisher  at  HP  laboratories  was  undertaken  to  evaluate  Heads-and- 
Tails  instruction  encoding  for  HP’s  Lx  embedded  VLIW  microprocessor,  using  HP  compilers  and  simulators. 

Fine-Grain  Dynamic  Leakage  Reduction 

Fine-Grain  Dynamic  Leakage  Reduction  Techniques  for  fine-grain  dynamic  leakage  reduction  are  being  evaluated 
within  the  Desktop  Products  Group  at  Intel  Corporation.  An  MIT  graduate  student  worked  as  an  intern  with  George 
Cai  at  Intel,  Austin  to  help  with  technology  transition.  Intel  is  continuing  to  fund  this  work  at  MIT. 

Banked  Register  Files 

A  graduate  student  is  currently  working  with  Xiaowie  Chen  at  IBM  T.  J.  Watson  evaluating  the  use  of  banked 
register  files  within  future  IBM  PowerPC  processors. 

Power  Modeling 

A  detailed  cache  and  memory  energy  model,  ZOOM,  was  developed  in  collaboration  with  Jude  Rivers  at  IBM’s 
T.J.  Watson  Laboratory.  A  student  worked  at  IBM  for  the  summer  to  incorporate  data  from  commercial  cache 
designs. 

A  second  graduate  student  is  currently  working  on  power  models  for  single-chip  multiprocessors  with  Pradip 
Bose  at  IBM  T.  J.  Watson. 

5  Conclusion 

A  variety  of  power  saving  techniques  at  both  the  microarchitectural  and  instruction  set  level  have  been  developed, 
several  of  which  are  being  actively  transferred  to  industry  through  student  internships.  Over  a  dozen  conference 
papers  and  student  theses  have  been  published  to  distribute  results  to  the  research  community.  The  SCALE  vector- 
thread  architecture  was  developed  and  the  detailed  design  is  now  being  pursued  in  other  work. 
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Abstract 

The  vector-thread  (VT)  architectural  paradigm  unifies  the  vector 
and  multithreaded  compute  models.  The  VT  abstraction  provides 
the  programmer  with  a  control  processor  and  a  vector  of  virtual 
processors  (VPs).  The  control  processor  can  use  vector-fetch  com¬ 
mands  to  broadcast  instructions  to  all  the  VPs  or  each  VP  can  use 
thread-fetches  to  direct  its  own  control  flow.  A  seamless  intermix¬ 
ing  of  the  vector  and  threaded  control  mechanisms  allows  a  VT  ar¬ 
chitecture  to  flexibly  and  compactly  encode  application  parallelism 
and  locality,  and  a  VT  machine  exploits  these  to  improve  perfor¬ 
mance  and  efficiency.  We  present  SCALE,  an  instantiation  of  the 
VT  architecture  designed  for  low-power  and  high-performance  em¬ 
bedded  systems.  We  evaluate  the  SCALE  prototype  design  using 
detailed  simulation  of  a  broad  range  of  embedded  applications  and 
show  that  its  performance  is  competitive  with  larger  and  more  com¬ 
plex  processors. 

1.  Introduction 

Parallelism  and  locality  are  the  key  application  characteristics 
exploited  by  computer  architects  to  make  productive  use  of  increas¬ 
ing  transistor  counts  while  coping  with  wire  delay  and  power  dissi¬ 
pation.  Conventional  sequential  ISAs  provide  minimal  support  for 
encoding  parallelism  or  locality,  so  high-performance  implementa¬ 
tions  are  forced  to  devote  considerable  area  and  power  to  on-chip 
structures  that  extract  parallelism  or  that  support  arbitrary  global 
communication.  The  large  area  and  power  overheads  are  justi¬ 
fied  by  the  demand  for  even  small  improvements  in  performance 
on  legacy  codes  for  popular  ISAs.  Many  important  applications 
have  abundant  parallelism,  however,  with  dependencies  and  com¬ 
munication  patterns  that  can  be  statically  determined.  ISAs  that 
expose  more  parallelism  reduce  the  need  for  area  and  power  in¬ 
tensive  structures  to  extract  dependencies  dynamically.  Similarly, 
ISAs  that  allow  locality  to  be  expressed  reduce  the  need  for  long- 
range  communication  and  complex  interconnect.  The  challenge  is 
to  develop  an  efficient  encoding  of  an  application’s  parallel  depen¬ 
dency  graph  and  to  reduce  the  area  and  power  consumption  of  the 
microarchitecture  that  will  execute  this  dependency  graph. 

In  this  paper,  we  unify  the  vector  and  multithreaded  execution 
models  with  the  vector-thread  (VT)  architectural  paradigm.  VT 
allows  large  amounts  of  structured  parallelism  to  be  compactly  en¬ 
coded  in  a  form  that  allows  a  simple  microarchitecture  to  attain 
high  performance  at  low  power  by  avoiding  complex  control  and 
datapath  structures  and  by  reducing  activity  on  long  wires.  The 
VT  programmer’s  model  extends  a  conventional  scalar  control  pro¬ 
cessor  with  an  array  of  slave  virtual  processors  (VPs).  VPs  ex¬ 
ecute  strings  of  RISC-like  instructions  packaged  into  atomic  in¬ 
struction  blocks  (AIBs).  To  execute  data-parallel  code,  the  control 
processor  broadcasts  AIBs  to  all  the  slave  VPs.  To  execute  thread- 


parallel  code,  each  VP  directs  its  own  control  flow  by  fetching  its 
own  AIBs.  Implementations  of  the  VT  architecture  can  also  exploit 
instruction-level  parallelism  within  AIBs. 

In  this  way,  the  VT  architecture  supports  a  modeless  intermin¬ 
gling  of  all  forms  of  application  parallelism.  This  flexibility  pro¬ 
vides  new  ways  to  parallelize  codes  that  are  difficult  to  vectorize  or 
that  incur  excessive  synchronization  costs  when  threaded.  Instruc¬ 
tion  locality  is  improved  by  allowing  common  code  to  be  factored 
out  and  executed  only  once  on  the  control  processor,  and  by  execut¬ 
ing  the  same  AIB  multiple  times  on  each  VP  in  turn.  Data  locality 
is  improved  as  most  operand  communication  is  isolated  to  within 
an  individual  VP. 

We  are  developing  a  prototype  processor,  SCALE,  which  is 
an  instantiation  of  the  vector-thread  architecture  designed  for 
low-power  and  high-performance  embedded  systems.  As  tran¬ 
sistors  have  become  cheaper  and  faster,  embedded  applications 
have  evolved  from  simple  control  functions  to  cellphones  that 
run  multitasking  networked  operating  systems  with  realtime  video, 
three-dimensional  graphics,  and  dynamic  compilation  of  garbage- 
collected  languages.  Many  other  embedded  applications  require 
sophisticated  high-performance  information  processing,  including 
streaming  media  devices,  network  routers,  and  wireless  base  sta¬ 
tions.  In  this  paper,  we  show  how  benchmarks  taken  from  these  em¬ 
bedded  domains  can  be  mapped  efficiently  to  the  SCALE  vector- 
thread  architecture.  In  many  cases,  the  codes  exploit  multiple  types 
of  parallelism  simultaneously  for  greater  efficiency. 

The  paper  is  structured  as  follows.  Section  2  introduces  the 
vector-thread  architectural  paradigm.  Section  3  then  describes  the 
SCALE  processor  which  contains  many  features  that  extend  the  ba¬ 
sic  VT  architecture.  Section  4  presents  an  evaluation  of  the  SCALE 
processor  using  a  range  of  embedded  benchmarks  and  describes 
how  SCALE  efficiently  executes  various  types  of  code.  Einally, 
Section  5  reviews  related  work  and  Section  6  concludes. 

2.  The  VT  Architectural  Paradigm 

An  architectural  paradigm  consists  of  the  programmer’s  model 
for  a  class  of  machines  plus  the  expected  structure  of  implementa¬ 
tions  of  these  machines.  This  section  first  describes  the  abstraction 
a  VT  architecture  provides  to  a  programmer,  then  gives  an  overview 
of  the  physical  model  for  a  VT  machine. 

2.1  VT  Abstract  Model 

The  vector-thread  architecture  is  a  hybrid  of  the  vector  and  mul¬ 
tithreaded  models.  A  conventional  control  processor  interacts  with 
a  virtual  processor  vector  (VPV),  as  shown  in  Figure  1.  The  pro¬ 
gramming  model  consists  of  two  interacting  instruction  sets,  one 
for  the  control  processor  and  one  for  the  VPs.  Applications  can 
be  mapped  to  the  VT  architecture  in  a  variety  of  ways  but  it  is  es- 
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Figure  1:  Abstract  model  of  a  vector-thread  architecture.  A  control 
processor  interacts  with  a  virtual  processor  vector  (an  ordered  se¬ 
quence  of  VPs). 


VPO  VP1  ...  VP[vl-1] 

vector-fetch 


Figure  2:  Vector-fetch  commands.  For  simple  data-parallel  loops,  the 
control  processor  can  use  a  vector-fetch  command  to  send  an  atomic 
instruction  block  (AIB)  to  all  the  VPs  in  parallel.  In  this  vector-vector 
add  example,  we  assume  that  rO  has  been  loaded  with  each  VP’s  in¬ 
dex  number;  and  rl,  r2,  and  r3  contain  the  base  addresses  of  the  in¬ 
put  and  output  vectors.  The  instruction  notation  places  the  destination 
registers  after  the 

pecially  well  suited  to  executing  loops;  each  VP  executes  a  single 
iteration  of  the  loop  and  the  control  processor  is  responsible  for 
managing  the  execution. 

A  virtual  processor  contains  a  set  of  registers  and  has  the  abil¬ 
ity  to  execute  RISC-like  instructions  with  virtual  register  specifiers. 
VP  instructions  are  grouped  into  atomic  instruction  blocks  (AIBs), 
the  unit  of  work  issued  to  a  VP  at  one  time.  There  is  no  auto¬ 
matic  program  counter  or  implicit  instruction  fetch  mechanism  for 
VPs;  all  instruction  blocks  must  be  explicitly  requested  by  either 
the  control  processor  or  the  VP  itself. 

The  control  processor  can  direct  the  VPs’  execution  using  a 
vector-fetch  command  to  issue  an  AIB  to  all  the  VPs  in  parallel, 
or  a  VP-fetch  to  target  an  individual  VP.  Vector-fetch  commands 
provide  a  programming  model  similar  to  conventional  vector  ma¬ 
chines  except  that  a  large  block  of  instructions  can  be  issued  at 
once.  As  a  simple  example.  Figure  2  shows  the  mapping  for  a  data- 
parallel  vector-vector  add  loop.  The  AIB  for  one  iteration  of  the 
loop  contains  two  loads,  an  add,  and  a  store.  A  vector-fetch  com¬ 
mand  sends  this  AIB  to  all  the  VPs  in  parallel  and  thus  initiates  vl 
loop  iterations,  where  vl  is  the  length  of  the  VPV  (i.e.,  the  vec¬ 
tor  length).  Every  VP  executes  the  same  instructions  but  operates 
on  distinct  data  elements  as  determined  by  its  index  number.  As 
a  more  efficient  alternative  to  the  individual  VP  loads  and  stores 
shown  in  the  example,  a  VT  architecture  can  also  provide  vector- 
memory  commands  issued  by  the  control  processor  which  move  a 
vector  of  elements  between  memory  and  one  register  in  each  VP. 

The  VT  abstract  model  connects  VPs  in  a  unidirectional  ring 
topology  and  allows  a  sending  instruction  on  VP  (n)  to  transfer 
data  directly  to  a  receiving  instruction  on  VP  (n-|-l).  These  cross- 
VP  data  transfers  are  dynamically  scheduled  and  resolve  when  the 
data  becomes  available.  Cross- VP  data  transfers  allow  loops  with 
cross-iteration  dependencies  to  be  efficiently  mapped  to  the  vector- 
thread  architecture,  as  shown  by  the  example  in  Figure  3.  A  single 
vector-fetch  command  can  introduce  a  chain  of  prevVP  receives 
and  next  VP  sends  that  spans  the  VPV.  The  control  processor  can 
push  an  initial  value  into  the  cross-VP  start/ stop  queue  (shown  in 
Figure  1)  before  executing  the  vector-fetch  command.  After  the 
chain  executes,  the  final  cross- VP  data  value  from  the  last  VP  wraps 
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Figure  3:  Cross- VP  data  transfers.  For  loops  with  cross-iteration  de¬ 
pendencies,  the  control  processor  can  vector-fetch  an  AIB  that  contains 
cross- VP  data  transfers.  In  this  saturating  parallel  prefix  sum  example, 
we  assume  that  rO  has  been  loaded  with  each  VP’s  index  number,  rl 
and  r2  contain  the  base  addresses  of  the  input  and  output  vectors,  and 
r3  and  r  4  contain  the  min  and  max  values  of  the  saturation  range.  The 
instruction  notation  uses  “  (p)  ”  to  indicate  predication. 
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Figure  4:  VP  threads.  Thread-fetches  allow  a  VP  to  request  its  own 
AIBs  and  thereby  direct  its  own  control  flow.  In  this  pointer-chase  ex¬ 
ample,  we  assume  that  r  0  contains  a  pointer  to  a  linked  list,  r  1  contains 
the  address  of  the  AIB,  and  r  2  contains  a  count  of  the  number  of  links 
traversed. 


around  and  is  written  into  this  same  queue.  It  can  then  be  popped 
by  the  control  processor  or  consumed  by  a  subsequent  prevVP 
receive  on  VPO  during  stripmined  loop  execution. 

The  VT  architecture  also  allows  VPs  to  direct  their  own  control 
flow.  A  VP  executes  a  thread-fetch  to  request  an  AIB  to  execute  af¬ 
ter  it  completes  its  active  AIB,  as  shown  in  Figure  4.  Fetch  instruc¬ 
tions  may  be  predicated  to  provide  conditional  branching.  A  VP 
thread  persists  as  long  as  each  AIB  contains  an  executed  fetch  in¬ 
struction,  but  halts  once  the  VP  stops  issuing  thread-fetches.  Once 
a  VP  thread  is  launched,  it  executes  to  completion  before  the  next 
command  from  the  control  processor  takes  effect.  The  control  pro¬ 
cessor  and  VPs  all  operate  concurrently  in  the  same  address  space. 
Memory  dependencies  between  these  processors  are  preserved  via 
explicit  memory  fence  and  synchronization  operations  or  atomic 
read-modify-write  operations. 

The  ability  to  freely  intermix  vector-fetches  and  thread-fetches 
allows  a  VT  architecture  to  combine  the  best  attributes  of  the  vec¬ 
tor  and  multithreaded  execution  paradigms.  As  shown  in  Figure  5, 
the  control  processor  can  issue  a  vector-fetch  command  to  launch  a 
vector  of  VP  threads,  each  of  which  continues  to  execute  as  long  as 
it  issues  thread-fetches.  These  thread-fetches  break  the  rigid  con¬ 
trol  flow  of  traditional  vector  machines,  enabling  the  VP  threads 
to  follow  independent  control  paths.  Thread-fetches  broaden  the 
range  of  loops  which  can  be  mapped  efficiently  to  VT,  allowing 
the  VPs  to  execute  data-parallel  loop  iterations  with  conditionals 
or  even  inner-loops.  Apart  from  loops,  the  VPs  can  also  be  used  as 
free-running  threads,  where  they  operate  independently  from  the 
control  processor  and  retrieve  tasks  from  a  shared  work  queue. 

The  VT  architecture  allows  software  to  efficiently  expose  struc¬ 
tured  parallelism  and  locality  at  a  fine  granularity.  Compared  to 
a  conventional  threaded  architecture,  the  VT  model  allows  com¬ 
mon  bookkeeping  code  to  be  factored  out  and  executed  once  on 
the  control  processor  rather  than  redundantly  in  each  thread.  AIBs 
enable  a  VT  machine  to  efficiently  amortize  instruction  fetch  over¬ 
head,  and  provide  a  framework  for  cleanly  handling  temporary 
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Figure  6:  Physical  model  of  a  VT  machine.  The  implementation  shown  has  four  parallel  lanes  in  the  vector-thread  unit  (VTU),  and  VPs  are  striped 
across  the  lane  array  with  the  low-order  bits  of  a  VP  index  indicating  the  lane  to  which  it  is  mapped.  The  configuration  shown  uses  VPs  with  five 
virtual  registers,  and  with  twenty  physical  registers  each  lane  is  able  to  support  four  VPs.  Each  lane  is  divided  into  a  command  management  unit 
(CMU)  and  an  execution  cluster,  and  the  execution  cluster  has  an  associated  cross- VP  start-stop  queue. 


vector-fetch 


vector-fetch 


VPO  VP1  VP2  VP3  .-  VP[vl-1] 


Figure  5:  The  control  processor  can  use  a  vector-fetch  command  to 
send  an  AIB  to  all  the  VPs,  after  which  each  VP  can  use  thread-fetches 
to  fetch  its  own  AIBs. 


Time 


thread-fetch 


state.  Vector-fetch  commands  explicitly  encode  parallelism  and 
instruction  locality,  allowing  a  VT  machine  to  attain  high  perfor¬ 
mance  while  amortizing  control  overhead.  Vector-memory  com¬ 
mands  avoid  separate  load  and  store  requests  for  each  element, 
and  can  he  used  to  exploit  memory  data-parallelism  even  in  loops 
with  non-data-parallel  compute.  For  loops  with  cross-iteration  de¬ 
pendencies,  cross-VP  data  transfers  explicitly  encode  fine-grain 
communication  and  synchronization,  avoiding  heavyweight  inter¬ 
thread  memory  coherence  and  synchronization  primitives. 

2.2  VT  Physical  Model 

An  architectural  paradigm’s  physical  model  is  the  expected 
structure  for  efficient  implementations  of  the  abstract  model.  The 
VT  physical  model  contains  a  conventional  scalar  unit  for  the  con¬ 
trol  processor  together  with  a  vector-thread  unit  (VTU)  that  exe¬ 
cutes  the  VP  code.  To  exploit  the  parallelism  exposed  by  the  VT  ab¬ 
stract  model,  the  VTU  contains  a  parallel  array  of  processing  lanes 
as  shown  in  Figure  6.  Lanes  are  the  physical  processors  which  VPs 
map  onto,  and  the  VPV  is  striped  across  the  lane  array.  Each  lane 
contains  physical  registers,  which  hold  the  state  of  VPs  mapped  to 
the  lane,  and  functional  units,  which  are  time-multiplexed  across 
the  VPs.  In  contrast  to  traditional  vector  machines,  the  lanes  in  a 
VT  machine  execute  decoupled  from  each  other.  Figure  7  shows  an 
abstract  view  of  how  VP  execution  is  time-multiplexed  on  the  lanes 
for  both  vector-fetched  and  thread-fetched  AIBs.  This  fine-grain 
interleaving  helps  VT  machines  hide  functional  unit,  memory,  and 
thread-fetch  latencies. 

As  shown  in  Figure  6,  each  lane  contains  both  a  command  man¬ 
agement  unit  (CMU)  and  an  execution  cluster.  An  execution  cluster 
consists  of  a  register  file,  functional  units,  and  a  small  AIB  cache. 


Figure  7:  Lane  Time-Multiplexing.  Both  vector-fetched  and  thread- 
fetched  AIBs  are  time-multiplexed  on  the  physical  lanes. 

The  lane’s  CMU  buffers  commands  from  the  control  processor  in 
a  queue  (cmd-Q)  and  holds  pending  thread-fetch  addresses  for  the 
lane’s  VPs.  The  CMU  also  holds  the  tags  for  the  lane’s  AIB  cache. 
The  AIB  cache  can  hold  one  or  more  AIBs  and  must  be  at  least 
large  enough  to  hold  an  AIB  of  the  maximum  size  defined  in  the 
VT  architecture. 

The  CMU  chooses  a  vector-fetch,  VP-fetch,  or  thread-fetch  com¬ 
mand  to  process.  The  fetch  command  contains  an  address  which  is 
looked  up  in  the  AIB  tags.  If  there  is  a  miss,  a  request  is  sent  to 
the  fill  unit  which  retrieves  the  requested  AIB  from  the  primary 
cache.  The  fill  unit  handles  one  lane’s  AIB  miss  at  a  time,  except  if 
lanes  are  executing  vector-fetch  commands  when  refill  overhead  is 
amortized  by  broadcasting  the  AIB  to  all  lanes  simultaneously. 

After  a  fetch  command  hits  in  the  AIB  cache  or  after  a  miss  refill 
has  been  processed,  the  CMU  generates  an  execute  directive  which 
contains  an  index  into  the  AIB  cache.  For  a  vector-fetch  command 
the  execute  directive  indicates  that  the  AIB  should  be  executed  by 
all  VPs  mapped  to  the  lane,  while  for  a  VP-fetch  or  thread-fetch 
command  it  identifies  a  single  VP  to  execute  the  AIB.  The  execute 
directive  is  sent  to  a  queue  in  the  execution  cluster,  leaving  the 
CMU  free  to  begin  processing  the  next  command.  The  CMU  is 
able  to  overlap  the  AIB  cache  refill  for  new  fetch  commands  with 
the  execution  of  previous  ones,  but  must  track  which  AIBs  have 
outstanding  execute  directives  to  avoid  overwriting  their  entries  in 
the  AIB  cache.  The  CMU  must  also  ensure  that  the  VP  threads 
execute  to  completion  before  initiating  a  subsequent  vector-fetch. 

To  process  an  execute  directive,  the  cluster  reads  VP  instructionsS 


one  by  one  from  the  AIB  cache  and  executes  them  for  the  appropri¬ 
ate  VP.  When  processing  an  execute-directive  from  a  vector-fetch 
command,  all  of  the  instructions  in  the  AIB  are  executed  for  one  VP 
before  moving  on  to  the  next.  The  virtual  register  indices  in  the  VP 
instructions  are  combined  with  the  active  VP  number  to  create  an 
index  into  the  physical  register  file.  To  execute  a  fetch  instruction, 
the  cluster  sends  the  requested  AIB  address  to  the  CMU  where  the 
VP’s  associated  pending  thread-fetch  register  is  updated. 

The  lanes  in  the  VTU  are  inter-connected  with  a  unidirectional 
ring  network  to  implement  the  cross- VP  data  transfers.  When  a 
cluster  encounters  an  instruction  with  a  prevVP  receive,  it  stalls 
until  the  data  is  available  from  its  predecessor  lane.  When  the  VT 
architecture  allows  multiple  cross- VP  instructions  in  a  single  AIB, 
with  some  sends  preceding  some  receives,  the  hardware  implemen¬ 
tation  must  provide  sufficient  buffering  of  send  data  to  allow  all  the 
receives  in  an  AIB  to  execute.  By  induction,  deadlock  is  avoided  if 
each  lane  ensures  that  its  predecessor  can  never  be  blocked  trying 
to  send  it  cross- VP  data. 

3.  The  SCALE  VT  Architecture 

SCALE  is  an  instance  of  the  VT  architectural  paradigm  designed 
for  embedded  systems.  The  SCALE  architecture  has  a  MIPS-based 
control  processor  extended  with  a  VTU.  The  SCALE  VTU  aims  to 
provide  high  performance  at  low  power  for  a  wide  range  of  appli¬ 
cations  while  using  only  a  small  area.  This  section  describes  the 
SCALE  VT  architecture,  presents  a  simple  code  example  imple¬ 
mented  on  SCALE,  and  gives  an  overview  of  the  SCALE  microar¬ 
chitecture  and  SCALE  processor  prototype. 

3.1  SCALE  Extensions  to  VT 

Clusters 

To  improve  performance  while  reducing  area,  energy  and  circuit 
delay,  SCALE  extends  the  single-cluster  VT  model  (shown  in  Eig- 
ure  1)  by  partitioning  VPs  into  multiple  execution  clusters  with  in¬ 
dependent  register  sets.  VP  instructions  target  an  individual  cluster 
and  perform  RISC-like  operations.  Source  operands  must  be  lo¬ 
cal  to  the  cluster,  but  results  can  be  written  to  any  cluster  in  the 
VP,  and  an  instruction  can  write  its  result  to  multiple  destinations. 
Each  cluster  within  a  VP  has  a  separate  predicate  register,  and  in¬ 
structions  can  be  positively  or  negatively  predicated. 

SCALE  clusters  are  heterogeneous,  but  all  clusters  support  basic 
integer  operations.  Cluster  0  additionally  supports  memory  access 
instructions,  cluster  1  supports  fetch  instructions,  and  cluster  3  sup¬ 
ports  integer  multiply  and  divide.  Though  not  used  in  this  paper,  the 
SCALE  architecture  allows  clusters  to  be  enhanced  with  layers  of 
additional  functionality  (e.g.,  floating-point  operations,  fixed-point 
operations,  and  sub-word  SIMD  operations),  or  new  clusters  to  be 
added  to  perform  specialized  operations. 

Registers  and  VP  Configuration 

The  general  registers  in  each  cluster  of  a  VP  are  categorized  as  ei¬ 
ther  private  registers  (pr’s)  and  shared  registers  (sr’s).  Both  pri¬ 
vate  and  shared  registers  can  be  read  and  written  by  VP  instructions 
and  by  commands  from  the  control  processor.  The  main  difference 
is  that  private  registers  preserve  their  values  between  AIBs,  while 
shared  registers  may  be  overwritten  by  a  different  VP.  Shared  reg¬ 
isters  can  be  used  as  temporary  state  within  an  AIB  to  increase  the 
number  of  VPs  that  can  be  supported  by  a  fixed  number  of  physical 
registers.  The  control  processor  can  also  vector-write  the  shared 
registers  to  broadcast  scalar  values  and  constants  used  by  all  VPs. 

In  addition  to  the  general  registers,  each  cluster  also  has 
programmer- visible  chain  registers  (cr  0  and  crl)  associated  with 


the  two  ALU  input  operands.  These  can  be  used  as  sources  and 
destinations  to  avoid  reading  and  writing  the  register  files.  Like 
shared  registers,  chain  registers  may  be  overwritten  between  AIBs, 
and  they  are  also  implicitly  overwritten  when  a  VP  instruction  uses 
their  associated  operand  position.  Cluster  0  has  a  special  chain  reg¬ 
ister  called  the  store-data  (sd)  register  through  which  all  data  for 
VP  stores  must  pass. 

In  the  SCALE  architecture,  the  control  processor  configures  the 
VPs  by  indicating  how  many  shared  and  private  registers  are  re¬ 
quired  in  each  cluster.  The  length  of  the  virtual  processor  vector 
changes  with  each  re-configuration  to  reflect  the  maximum  num¬ 
ber  of  VPs  that  can  be  supported.  This  operation  is  typically  done 
once  outside  each  loop,  and  state  in  the  VPs  is  undefined  across  re¬ 
configurations.  Within  a  lane,  the  VTU  maps  shared  VP  registers 
to  shared  physical  registers.  Control  processor  vector-writes  to  a 
shared  register  are  broadcast  to  each  lane,  but  individual  VP  writes 
to  a  shared  register  are  not  coherent  across  lanes,  i.e.,  the  shared 
registers  are  not  global  registers. 

Vector-Memory  Commands 

In  addition  to  VP  load  and  store  instructions,  SCALE  defines 
vector-memory  commands  issued  by  the  control  processor  for  effi¬ 
cient  structured  memory  accesses.  Like  vector-fetch  commands, 
these  operate  across  the  virtual  processor  vector;  a  vector-load 
writes  the  load  data  to  a  private  register  in  each  VP,  while  a  vector- 
store  reads  the  store  data  from  a  private  register  in  each  VP.  SCALE 
also  supports  vector-load  commands  which  target  shared  registers 
to  retrieve  values  used  by  all  VPs.  In  addition  to  the  typical  unit- 
stride  and  strided  vector-memory  access  patterns,  SCALE  provides 
vector  segment  accesses  where  each  VP  loads  or  stores  several  con¬ 
tiguous  memory  elements  to  support  “array-of-structures”  data  lay¬ 
outs  efficiently. 

3.2  SCALE  Code  Example 

This  section  presents  a  simple  code  example  to  show  how 
SCALE  is  programmed.  The  C  code  in  Eigure  8  implements  a  sim¬ 
plified  version  of  the  ADPCM  speech  decoder.  Input  is  read  from 
a  unit-stride  byte  stream  and  output  is  written  to  a  unit-stride  half¬ 
word  stream.  The  loop  is  non-vectorizable  because  it  contains  two 
loop-carried  dependencies:  the  index  and  valpr ed  variables  are 
accumulated  from  one  iteration  to  the  next  with  saturation.  The 
loop  also  contains  two  table  lookups. 

The  SCALE  code  to  implement  the  example  decoder  function 
is  shown  in  Figure  9.  The  code  is  divided  into  two  sections  with 
MIPS  control  processor  code  in  the  .  text  section  and  SCALE  VP 
code  in  the  .  sisa  (SCALE  ISA)  section.  The  SCALE  VP  code 
implements  one  iteration  of  the  loop  with  a  single  AIB;  cluster  0 
accesses  memory,  cluster  1  accumulates  index,  cluster  2  accumu¬ 
lates  valpred,  and  cluster  3  does  the  multiply. 

The  control  processor  first  configures  the  VPs  using  the  vcf  gvl 
command  to  indicate  the  register  requirements  for  each  cluster.  In 
this  example,  cO  uses  one  private  register  to  hold  the  input  data  and 
two  shared  registers  to  hold  the  table  pointers;  cl  and  c2  each  use 
three  shared  registers  to  hold  the  min  and  max  saturation  values 
and  a  temporary;  c2  also  uses  a  private  register  to  hold  the  out¬ 
put  value;  and  c3  uses  only  chain  registers  so  it  does  not  need  any 
shared  or  private  registers.  The  configuration  indirectly  sets  vl- 
max,  the  maximum  vector  length.  In  a  SCALE  implementation 
with  32  physical  registers  per  cluster  and  four  lanes,  vlmax  would 
be:  ( [(32  —  3)/lJ )  x  4  =  116,  limited  by  the  register  demands  of 
cluster  2.  The  vcfgvl  command  also  sets  vl,  the  active  vector- 
length,  to  the  minimum  of  vlmax  and  the  length  argument  pro¬ 
vided;  the  resulting  length  is  returned  as  a  result.  The  control  pro- 
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void  decode_ex  (int  len,  ii_int8_t*  in,  intl6_t*  out)  { 
int  i; 

int  index  =  0; 
int  valpred  =  0; 
for(i  =  0;  i  <  len;  i++)  { 

u_int8_t  delta  =  in[i]; 
index  +=  indexTable [delta] ; 
index  =  index  <  IX_MIN  ?  IX_MIN  :  index; 
index  =  IX_MAX  <  index  ?  IX_MAX  :  index; 
valpred  +=  stepsizeTable [index]  *  delta; 
valpred  =  valpred  <  VALP_MIN  ?  VALP_MIN  :  valpred; 
valpred  =  VALP_MAX  <  valpred  ?  VALP_MAX  :  valpred; 
out[i]  =  valpred; 

} 

} 

Figure  8:  C  code  for  decoder  example. 

cessor  next  vector- writes  several  shared  VP  registers  with  constants 
using  the  vwrsh  command,  then  uses  the  xvppush  command  to 
push  the  initial  index  and  valpred  values  into  the  cross- VP 
start/ stop  queues  for  clusters  1  and  2. 

The  ISA  for  a  VT  architecture  is  defined  so  that  code  can 
be  written  to  work  with  any  number  of  VPs,  allowing  the  same 
object  code  to  run  on  implementations  with  varying  or  config¬ 
urable  resources.  To  manage  the  execution  of  the  loop,  the  con¬ 
trol  processor  uses  stripmining  to  repeatedly  launch  a  vector  of 
loop  iterations.  For  each  iteration  of  the  stripmine  loop,  the  con¬ 
trol  processor  uses  the  setvl  command  which  sets  the  vector- 
length  to  the  minimum  of  vlmax  and  the  length  argument  pro¬ 
vided  (i.e.,  the  number  of  iterations  remaining  for  the  loop);  the 
resulting  vector-length  is  also  returned  as  a  result.  In  the  de¬ 
coder  example,  the  control  processor  then  loads  the  input  using 
an  auto-incrementing  vector-load-byte-unsigned  command  (vl- 
buai),  vector-fetches  the  AIB  to  compute  the  decode,  and  stores 
the  output  using  an  auto-incrementing  vector-store-halfword  com¬ 
mand  (vshai).  The  cross-iteration  dependencies  are  passed  from 
one  stripmine  vector  to  the  next  through  the  cross- VP  start/stop 
queues.  At  the  end  of  the  function  the  control  processor  uses  the 
xvppop  command  to  pop  and  discard  the  final  values. 

The  SCALE  VP  code  implements  one  iteration  of  the  loop  in 
a  straightforward  manner  with  no  cross-iteration  static  scheduling. 
Cluster  0  holds  the  delta  input  value  in  prO  from  the  previous 
vector-load.  It  uses  a  VP  load  to  perform  the  indexTable  lookup 
and  sends  the  result  to  cluster  1 .  Cluster  1  uses  five  instructions  to 
accumulate  and  saturate  index,  using  prevVP  and  nextVP  to 
receive  and  send  the  cross-iteration  value,  and  the  psel  (predicate- 
select)  instruction  to  optimize  the  saturation.  Cluster  0  then  per¬ 
forms  the  stepsizeTable  lookup  using  the  index  value,  and 
sends  the  result  to  cluster  3  where  it  is  multiplied  by  delta.  Clus¬ 
ter  2  uses  five  instructions  to  accumulate  and  saturate  valpred, 
writing  the  result  to  pr  0  for  the  subsequent  vector-store. 

3.3  SCALE  Microarchitecture 

The  SCALE  microarchitecture  is  an  extension  of  the  general  VT 
physical  model  shown  in  Eigure  6.  A  lane  has  a  single  CMU  and 
one  physical  execution  cluster  per  VP  cluster.  Each  cluster  has  a 
dedicated  output  bus  which  broadcasts  data  to  the  other  clusters  in 
the  lane,  and  it  also  connects  to  its  sibling  clusters  in  neighbor¬ 
ing  lanes  to  support  cross-VP  data  transfers.  An  overview  of  the 
SCALE  lane  microarchitecture  is  shown  in  Figure  10. 

Micro-Ops  and  Cluster  Decoupling 

The  SCALE  software  ISA  is  portable  across  multiple  SCALE 
implementations,  but  is  designed  to  be  easy  to  translate  into 
implementation-specific  micro-operations,  or  micro-ops.  The  as¬ 
sembler  translates  the  SCALE  software  ISA  into  the  native  hard- 
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.sisa  #  SCALE  VP  code 
vtu_decode_ex : 

.aib  begin 


cO 

sll 

prO, 

2 

-> 

crl 

# 

word  offset 

cO 

Iw 

crl (srO) 

-> 

cl/crO 
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Figure  9:  SCALE  code  implementing  decoder  example  from  Figure  8. 

ware  ISA  at  compile  time.  There  are  three  categories  of  hardware 
micro-ops:  a  compute-op  performs  the  main  RISC-like  operation  of 
a  VP  instruction;  a  transport-op  sends  data  to  another  cluster;  and, 
a  writeback-op  receives  data  sent  from  an  external  cluster.  The  as¬ 
sembler  reorganizes  micro-ops  derived  from  an  AIB  into  micro-op 
bundles  which  target  a  single  cluster  and  do  not  access  other  clus¬ 
ters’  registers.  Figure  1 1  shows  how  the  SCALE  VP  instructions 
from  the  decoder  example  are  translated  into  micro-op  bundles. 
All  inter-cluster  data  dependencies  are  encoded  by  the  transport- 
ops  and  writeback-ops  which  are  added  to  the  sending  and  receiv¬ 
ing  cluster  respectively.  This  allows  the  micro-op  bundles  for  each 
cluster  to  be  packed  together  independently  from  the  micro-op  bun¬ 
dles  for  other  clusters. 

Partitioning  inter-cluster  data  transfers  into  separate  transport 
and  writeback  operations  enables  decoupled  execution  between 
clusters.  In  SCALE,  a  cluster’s  AIB  cache  contains  micro-op  bun¬ 
dles,  and  each  cluster  has  a  local  execute  directive  queue  and  local 
control.  Each  cluster  processes  its  transport-ops  in  order,  broad¬ 
casting  result  values  onto  its  dedicated  output  data  bus;  and  each 
cluster  processes  its  writeback-ops  in  order,  writing  the  values  from 
external  clusters  to  its  local  registers.  The  inter-cluster  data  depen¬ 
dencies  are  synchronized  with  handshake  signals  which  extend  be¬ 
tween  the  clusters,  and  a  transaction  only  completes  when  both  the 
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Cluster  0  Cluster  1  Cluster  2  Cluster  3 
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1  wb-op 
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1  tp-op  1 

1  wb-op  1 
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■iiajial 
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Bang 
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sit  srl ,  sr2— >-p 

^■1 

Iw  crl  (srl ) 

Baa 

psel  sr2 , srl 

psel  sr2,srl— ^prO 

Figure  11:  Cluster  micro-op  bundles  for  the  AIB  in  Figure  9.  The  writeback-op  field  is  labeled  as  ’wb-op’  and  the  transport-op  field  is  labeled  as 
’tp-op’.  A  writeback-op  is  marked  with  ’b’  when  the  dependency  order  is  writeback-op  followed  by  compute-op.  The  prevVP  and  nextVP  identifiers 
are  abbreviafed  as  ’pVP’  and  ’nVP’. 


data 


Figure  10:  SCALE  Lane  Microarchitecture.  The  AIB  caches  in  SCALE 
hold  micro-op  bundles.  The  compute-op  is  a  local  RISC  operation  on 
the  cluster,  the  transport-op  sends  data  to  external  clusters,  and  the 
writeback-op  receives  data  from  external  clusters.  Clusters  1,  2,  and 
3  are  basic  cluster  designs  with  writeback-op  and  transport-op  decou¬ 
pling  resources  (cluster  1  is  shown  in  detail,  clusters  2  and  3  are  shown 
in  abstract).  Cluster  0  connects  to  memory  and  includes  memory  access 
decoupling  resources. 


sender  and  the  receiver  are  ready.  Although  compute-ops  execute 
in  order,  each  cluster  contains  a  transport  queue  to  allow  execution 
to  proceed  without  waiting  for  external  destination  clusters  to  re¬ 
ceive  the  data,  and  a  writeback  queue  to  allow  execution  to  proceed 
without  waiting  for  data  from  external  clusters  (until  it  is  needed 
by  a  compute-op).  These  queues  make  inter-cluster  synchroniza- 
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Eigure  12:  Execution  of  decoder  example  on  SCALE.  Each  cluster  ex¬ 
ecutes  in-order,  but  cluster  and  lane  decoupling  allows  the  execution  to 
automatically  adapt  to  the  software  critical  path.  Critical  dependencies 
are  shown  with  arrows  (solid  for  inter-cluster  within  a  lane,  dotted  for 
eross-VP). 

tion  more  flexible,  and  thereby  enhance  cluster  decoupling. 

A  schematic  diagram  of  the  example  decoder  loop  executing  on 
SCALE  (extracted  from  simulation  trace  output)  is  shown  in  Fig¬ 
ure  12.  Each  cluster  executes  the  vector-fetched  AIB  for  each  VP 
mapped  to  its  lane,  and  decoupling  allows  each  cluster  to  advance 
to  the  next  VP  independently.  Execution  automatically  adapts  to 
the  software  critical  path  as  each  cluster’s  local  data  dependencies 
resolve.  In  the  example  loop,  the  accumulations  of  index  and 
valpred  must  execute  serially,  but  all  of  the  other  instructions 
are  not  on  the  software  critical  path.  Furthermore,  the  two  accumu¬ 
lations  can  execute  in  parallel,  so  the  cross-iteration  serialization 
penalty  is  paid  only  once.  Each  loop  iteration  (i.e.,  VP)  executes 
over  a  period  of  30  cycles,  but  the  combination  of  multiple  lanes 
and  cluster  decoupling  within  each  lane  leads  to  as  many  as  six 
loop  iterations  executing  simultaneously. 

Memory  Access  Decoupling 

All  VP  loads  and  stores  execute  on  cluster  0  (cO),  and  it  is  specially 
designed  to  enable  access-execute  decoupling  [11].  Typically,  cO 
loads  data  values  from  memory  and  sends  them  to  other  clusters, 
computation  is  performed  on  the  data,  and  results  are  returned  to  cO 
and  stored  to  memory.  With  basic  cluster  decoupling,  cO  can  con¬ 
tinue  execution  after  a  load  without  waiting  for  the  other  clusters 
to  receive  the  data.  Cluster  0  is  further  enhanced  to  hide  memory 
latencies  by  continuing  execution  after  a  load  misses  in  the  cache, 
and  therefore  it  may  retrieve  load  data  from  the  cache  out  of  or¬ 
der.  However,  like  other  instructions,  load  operations  on  cluster  0 
use  transport-ops  to  deliver  data  to  other  clusters  in  order,  and  cO 
uses  a  load  data  queue  to  buffer  the  data  and  preserve  the  correct 
ordering. 

Importantly,  when  cluster  0  encounters  a  store,  it  does  not  stall  to 
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Figure  13:  Preliminary  floorplan  estimate  for  SCALE  prototype.  The 
prototype  contains  a  scalar  control  processor,  four  32-bit  lanes  with 
four  execution  clusters  each,  and  32  KB  of  cache  in  an  estimated 
10  mm^  in  0.18  /im  technology. 


wait  for  the  data  to  be  ready.  Instead  it  buffers  the  store  operation, 
including  the  store  address,  in  the  decoupled  store  queue  until  the 
store  data  is  available.  When  a  SCALE  VP  instruction  targets  the 
sd  register,  the  resulting  transport-op  sends  data  to  the  store  unit 
rather  than  to  cO;  thus,  the  store  unit  acts  as  a  primary  destination 
for  inter-cluster  transport  operations  and  it  handles  the  writeback- 
ops  for  sd.  Store  decoupling  allows  a  lane’s  load  stream  to  slip 
ahead  of  its  store  stream,  but  loads  for  a  given  VP  are  not  allowed 
to  bypass  previous  stores  to  the  same  address  by  the  same  VP. 

Vector-Memory  Accesses 

Vector-memory  commands  are  sent  to  the  clusters  as  special  exe¬ 
cute  directives  which  generate  micro-ops  instead  of  reading  them 
from  the  AIB  cache.  For  a  vector-load,  writeback-ops  on  the  desti¬ 
nation  cluster  receive  the  load  data;  and  for  a  vector-store,  compute- 
ops  and  transport-ops  on  the  source  cluster  read  and  send  the  store 
data.  Chaining  is  provided  to  allow  overlapped  execution  of  vector- 
fetched  AIBs  and  vector-memory  operations. 

The  vector-memory  commands  are  also  sent  to  the  vector- 
memory  unit  which  performs  the  necessary  cache  accesses.  The 
vector-memory  unit  can  only  send  one  address  to  the  cache  each  cy¬ 
cle,  but  it  takes  advantage  of  the  structured  access  patterns  to  load 
or  store  multiple  elements  with  each  access.  The  vector-memory 
unit  communicates  load  and  store  data  to  and  from  cluster  0  in  each 
lane  to  reuse  the  buffering  already  provided  for  the  decoupled  VP 
loads  and  stores. 

3.4  Prototype 

We  are  currently  designing  a  prototype  SCALE  processor,  and 
an  initial  floorplan  is  shown  in  Figure  13.  The  prototype  contains  a 
single-issue  MIPS  scalar  control  processor,  four  32-bit  lanes  with 
four  execution  clusters  each,  and  a  32  KB  shared  primary  cache. 
The  VTU  has  32  registers  per  cluster  and  supports  up  to  128  vir¬ 
tual  processors.  The  prototype’s  unified  LI  cache  is  32-way  set- 
associative  [15]  and  divided  into  four  banks.  The  vector  memory 
unit  can  perform  a  single  access  per  cycle,  fetching  up  to  128  bits 
from  a  single  bank,  or  all  lanes  can  perform  VP  accesses  from  dif¬ 
ferent  banks.  The  cache  is  non-blocking  and  connects  to  off-chip 
DDR2  SDRAM. 

The  area  estimate  of  around  10  mm^  in  0.18  pm  technology  is 
based  on  microarchitecture-level  datapath  designs  for  the  control 
processor  and  VTU  lanes;  cell  dimensions  based  on  layout  for  the 
datapath  blocks,  register  files,  CAMs,  SRAM  arrays,  and  cross¬ 
bars;  and  estimates  for  the  synthesized  control  logic  and  external 
interface  overhead.  We  have  designed  the  SCALE  prototype  to 


1  Vector-Thread  Unit  | 

Number  of  lanes 

4 

Clusters  per  lane 

4 

Registers  per  cluster 

32 

AIB  cache  uops  per  cluster 

32 

Intra-cluster  bypass  latency 

0  cycles 

Intra-lane  transport  latency 

1  cycle 

Cross- VP  transport  latency 

0  cycles 

Clock  frequency 

400  MHz 

1  LI  Unified  Cache  | 

Size 

32  KB 

Associativity 

32  (CAM  tags) 

Line  size 

32B 

Banks 

4 

Maximum  bank  access  width 

16B 

Store  miss  policy 

write-allocate/write-back 

Load-use  latency 

2  cycles 

1  Memory  System  | 

DRAM  type 

DDR2 

Data  bus  width 

64  bits 

DRAM  clock  frequency 

200  MHz 

Data  bus  frequency 

400  MHz 

Minimum  load-use  latency 

35  processor  cycles 

Table  1:  Default  parameters  for  SCALE  simulations. 


fit  into  a  compact  area  to  reduce  wire  delays  and  design  complex¬ 
ity,  and  to  support  tiling  of  multiple  SCALE  processors  on  a  CMP 
for  increased  processing  throughput.  The  clock  frequency  target  is 
400  MHz  based  on  a  25  F04  cycle  time,  chosen  as  a  compromise 
between  performance,  power,  and  design  complexity. 

4.  Evaluation 

This  section  contains  an  evaluation  and  analysis  of  SCALE  run¬ 
ning  a  diverse  selection  of  embedded  benchmark  codes.  We  first 
describe  the  simulation  methodology  and  benchmarks,  then  discuss 
how  the  benchmark  codes  were  mapped  to  the  VT  architecture  and 
the  resulting  efficiency  of  execution. 

4.1  Programming  and  Simulation  Methodology 

SCALE  was  designed  to  be  compiler-friendly,  and  a  C  compiler 
is  under  development.  For  the  results  in  this  paper,  all  VTU  code 
was  hand-written  in  SCALE  assembler  (as  in  Figure  9)  and  linked 
with  C  code  compiled  for  the  MIPS  control  processor  using  gcc. 
The  same  binary  code  was  used  across  all  SCALE  configurations. 

A  detailed  cycle-level,  execution-driven  microarchitectural  sim¬ 
ulator  has  been  developed  based  on  the  prototype  design.  De¬ 
tails  modeled  in  the  VTU  simulation  include  cluster  execution  of 
micro-ops  governed  by  execute-directives;  cluster  decoupling  and 
dynamic  inter-cluster  data  dependency  resolution;  memory  access 
decoupling;  operation  of  the  vector-memory  unit;  operation  of 
the  command  management  unit,  including  vector-fetch  and  thread- 
fetch  commands  with  AIB  tag-checking  and  miss  handling;  and  the 
AIB  fill  unit  and  its  contention  for  the  primary  cache. 

The  VTU  simulation  is  complemented  by  a  cycle-based  mem¬ 
ory  system  simulation  which  models  the  multi-requester,  multi- 
banked,  non-blocking,  highly-associative  CAM-based  cache  and 
a  detailed  memory  controller  and  DRAM  model.  The  cache  ac¬ 
curately  models  bank  conflicts  between  different  requesters;  ex¬ 
erts  back-pressure  in  response  to  cache  contention;  tracks  pend¬ 
ing  misses  and  merges  in  new  requests;  and  models  cache-line  re¬ 
fills  and  writebacks.  The  DRAM  simulation  is  based  on  the  DDR2 
chips  used  in  the  prototype  design,  and  models  a  64-bit  wide  mem¬ 
ory  port  clocked  at  200  MHz  (400  Mb/s/pin)  including  page  refresh, 
precharge  and  burst  effects. 

The  default  simulation  parameters  are  based  on  the  prototype  and 
are  summarized  in  Table  1.  An  intra-lane  transport  from  one  cluster 
to  another  has  a  latency  of  one  cycle  (i.e.  there  will  be  a  one  cycle 
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bubble  between  the  producing  instruction  and  the  dependent  in¬ 
struction).  Cross-VP  transports  are  able  to  have  zero  cycle  latency 
because  the  clusters  are  physically  closer  together  and  there  is  less 
fan-in  for  the  receive  operation.  Cache  accesses  have  a  two  cy¬ 
cle  latency  (two  bubble  cycles  between  load  and  use),  and  accesses 
which  miss  in  the  cache  have  a  minimum  latency  of  35  cycles. 

To  show  scaling  effects,  we  model  four  SCALE  configurations 
with  one,  two,  four,  and  eight  lanes.  The  one,  two,  and  four 
lane  configurations  each  include  four  cache  banks  and  one  64-bit 
DRAM  port.  For  eight  lanes,  the  memory  system  was  doubled  to 
eight  cache  hanks  and  two  64-bit  memory  ports  to  appropriately 
match  the  increased  compute  bandwidth. 

4.2  Benchmarks  and  Results 

We  have  implemented  a  selection  of  benchmarks  (Table  2)  to 
illustrate  the  key  features  of  SCALE,  including  examples  from  net¬ 
work  processing,  image  processing,  cryptography,  and  audio  pro¬ 
cessing.  The  majority  of  these  benchmarks  come  from  the  EEMBC 
benchmark  suite.  The  EEMBC  benchmarks  may  either  be  run  “out- 
of-the-box”  (OTB)  as  compiled  unmodified  C  code,  or  they  may  be 
optimized  (OPT)  using  assembly  coding  and  arbitrary  hand-tuning. 
This  enables  a  direct  comparison  of  SCALE  running  hand-written 
assembly  code  to  optimized  results  from  industry  processors.  Al¬ 
though  OPT  results  match  the  typical  way  in  which  these  pro¬ 
cessors  are  used,  one  drawback  of  this  form  of  evaluation  is  that 
performance  depends  greatly  on  programmer  effort,  especially  as 
EEMBC  permits  algorithmic  and  data-structure  changes  to  many 
of  the  benchmark  kernels,  and  optimizations  used  for  the  reported 
results  are  often  unpublished.  Also,  not  all  of  the  EEMBC  results 
are  available  for  all  processors,  as  results  are  often  submitted  for 
only  one  of  the  domain-specific  suites  (e.g.,  telecom). 

We  made  algorithmic  changes  to  several  of  the  EEMBC  bench¬ 
marks:  rotate  blocks  the  algorithm  to  enable  rotating  an  8-bit 
block  completely  in  registers,  pkt  f  1  ow  implements  the  packet  de¬ 
scriptor  queue  using  an  array  instead  of  a  linked  list,  fir  optimizes 
the  default  algorithm  to  avoid  copying  and  exploit  reuse,  fbital 
uses  a  binary  search  to  optimize  the  bit  allocation,  conven  uses 
bit  packed  input  data  to  enable  multiple  bit-level  operations  to  be 
performed  in  parallel,  and  f  ft  uses  a  radix-2  hybrid  Stockham  al¬ 
gorithm  to  eliminate  bit-reversal  and  increase  vector  lengths. 

Figure  14  shows  the  simulated  performance  of  the  various 
SCALE  processor  configurations  relative  to  several  reasonable 
competitors  from  among  those  with  the  best  published  EEMBC 
benchmark  scores  in  each  domain.  For  each  of  the  different  bench¬ 
marks,  Table  3  shows  VP  configuration  and  vector-length  statistics, 
and  Tables  4  and  5  give  statistics  showing  the  effectiveness  of  the 
SCALE  control  and  data  hierarchies.  These  are  discussed  further 
in  the  following  sections. 

The  AMD  AullOO  was  included  to  validate  the  SCALE  con¬ 
trol  processor  OTB  performance,  as  it  has  a  similar  structure  and 
clock  frequency,  and  also  uses  gcc.  The  Philips  TriMedia  TM 
1300  is  a  five-issue  VLIW  processor  with  a  32-bit  datapath.  It  runs 
at  166  MHz  and  has  a  32  KB  LI  instruction  cache  and  16  KB  LI 
data  cache,  with  a  32-bit  memory  port  running  at  125  MHz.  The 
Motorola  PowerPC  (MPC7447)  is  a  four-issue  out-of-order  super¬ 
scalar  processor  which  runs  at  1.3  GHz  and  has  32  KB  separate  LI 
instruction  and  data  caches,  a  512  KB  L2  cache,  and  a  64-bit  mem¬ 
ory  port  running  at  133  MHz.  The  OPT  results  for  the  processor 
use  its  Altivec  SIMD  unit  which  has  a  128-bit  datapath  and  four 
execution  units.  The  VIRAM  processor  [4]  is  a  research  vector 
processor  with  four  64-bit  lanes.  VIRAM  runs  at  200  MHz  and  in¬ 
cludes  13  MB  of  embedded  DRAM  supporting  up  to  256  bits  each 
of  load  and  store  data,  and  four  independent  addresses  per  cycle. 


The  BOPS  Manta  is  a  clustered  VLIW  DSP  with  four  clusters  each 
capable  of  executing  up  to  five  instructions  per  cycle  on  64-bit  dat¬ 
apaths.  The  Manta  2.0  runs  at  136  MHz  with  128  KB  of  on-chip 
memory  connected  to  a  32-bit  memory  port  running  at  136  MHz. 
The  TI  TMS  TMS320C6416  is  a  clustered  VLIW  DSP  with  two 
clusters  each  capable  of  executing  up  to  four  instructions  per  cycle. 
It  runs  at  720  MHz  and  has  a  16  KB  instruction  cache  and  a  16  KB 
data  cache  together  with  1  MB  of  on-chip  SRAM.  The  TI  has  a  64- 
bit  memory  interface  running  at  720 MHz.  Apart  from  the  AullOO 
and  SCALE,  all  other  processors  implement  SIMD  operations  on 
packed  subword  values  and  these  are  widely  exploited  throughout 
the  benchmark  set. 

Overall,  the  results  show  that  SCALE  can  flexibly  provide  com¬ 
petitive  performance  with  larger  and  more  complex  processors  on  a 
wide  range  of  codes  from  different  domains,  and  that  performance 
generally  scales  well  when  adding  new  lanes.  The  results  also  illus¬ 
trate  the  large  speedups  possible  when  algorithms  are  extensively 
tuned  for  a  highly  parallel  processor  versus  compiled  from  stan¬ 
dard  reference  code.  SCALE  results  for  fft  and  viterbi  are 
not  as  competitive  with  the  DSPs.  This  is  partly  due  to  these  be¬ 
ing  preliminary  versions  of  the  code  with  further  scope  for  tuning 
(e.g.,  moving  the  current  radix-2  FFT  to  radix-4  and  using  outer- 
loop  vectorization  for  viterbi)  and  partly  due  to  the  DSPs  hav¬ 
ing  special  support  for  these  operations  (e.g.,  complex  multiply  on 
BOPS).  We  expect  SCALE  performance  to  increase  significantly 
with  the  addition  of  subword  operations  and  with  improvements  to 
the  microarchitecture  driven  by  these  early  results. 

4.3  Mapping  Parallelism  to  SCALE 

The  SCALE  VT  architecture  allows  software  to  explicitly  en¬ 
code  the  parallelism  and  locality  available  in  an  application.  This 
section  evaluates  the  architecture’s  expressiveness  in  mapping  dif¬ 
ferent  types  of  code. 

Data-Parallel  Loops  with  No  Control  Flow 

Data-parallel  loops  with  no  internal  control  flow,  i.e.  simple  vec- 
torizable  loops,  may  be  ported  to  the  VT  architecture  in  a  similar 
manner  as  other  vector  architectures.  Vector-fetch  commands  en¬ 
code  the  cross-iteration  parallelism  between  blocks  of  instructions, 
while  vector-memory  commands  encode  data  locality  and  enable 
optimized  memory  access.  The  EEMBC  image  processing  bench¬ 
marks  (rgbcmy,  rgbyiq,  hpg)  are  examples  of  streaming  vec- 
torizable  code  for  which  SCALE  is  able  to  achieve  high  perfor¬ 
mance  that  scales  with  the  number  of  lanes  in  the  VTU.  A  4-lane 
SCALE  achieves  performance  competitive  with  VIRAM  for  rg¬ 
byiq  and  rgbcmy  despite  having  half  the  main  memory  band¬ 
width,  primarily  because  VIRAM  is  limited  by  strided  accesses 
while  SCALE  refills  the  cache  with  unit-stride  bursts  and  then  has 
higher  strided  bandwidth  into  the  cache.  For  the  unit-stride  hpg 
benchmark,  performance  follows  memory  bandwidth  with  the  8- 
lane  SCALE  approximately  matching  VIRAM. 

Data-Parallel  Loops  with  Conditionals 

Traditional  vector  machines  handle  conditional  code  with  predica¬ 
tion  (masking),  but  the  VT  architecture  adds  the  ability  to  condi¬ 
tionally  branch.  Predication  can  be  less  overhead  for  small  condi¬ 
tionals,  but  branching  results  in  less  work  when  conditional  blocks 
are  large.  EEMBC  dither  is  an  example  of  a  large  conditional 
block  in  a  data  parallel  loop.  This  benchmark  converts  a  grey-scale 
image  to  black  and  white,  and  the  dithering  algorithm  handles  white 
pixels  as  a  special  case.  In  the  SCALE  code,  each  VP  executes  a 
conditional  fetch  for  each  pixel,  executing  only  18  SCALE  instruc¬ 
tions  for  white  pixels  versus  49  for  non-white  pixels. 
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Table  2:  Benchmark  Statistics  and  Characterization.  All  numbers  are  for  the  default  SCALE  configuration  with  four  lanes.  Results  for  multiple 
input  data  sets  are  shown  separately  if  there  was  significant  variation,  otherwise  an  all  data  set  indicates  results  were  similar  across  inputs.  As  is 
standard  practice,  EEMBC  statistics  are  for  the  kernel  only.  Total  cycle  numbers  for  non-EEMBC  benchmarks  are  for  the  entire  application,  while 
the  remaining  statistics  are  for  the  kernel  of  the  benchmark  only  (the  kernel  excludes  benchmark  overhead  code  and  for  1  i  the  kernel  consists  of  the 
garbage  collector  only).  The  Mem  B/Cycle  column  shows  the  DRAM  bandwidth  in  bytes  per  cycle.  The  Loop  Type  column  indicates  the  categories  of 
loops  which  were  parallelized  when  mapping  the  benchmark  to  SCALE:  [DP]  data-parallel  loop  with  no  control  flow,  [DC]  data-parallel  loop  with 
conditional  thread-fetches,  [XI]  loop  with  cross-iteration  dependencies,  [DI]  data-parallel  loop  with  inner-loop,  [DE]  loop  with  data-dependent  exit 
condition,  and  [FT]  free-running  threads.  The  Mem  column  indicates  the  types  of  memory  accesses  performed:  [VM]  for  vector-memory  accesses 
and  [VP]  for  individual  VP  loads  and  stores. 
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Table  3:  VP  configuration  and  vector-length  statistics  as  averages  of  data  recorded  at  each  vector-fetch  command.  The  VP  configuration  register 
counts  represent  totals  across  all  four  clusters,  vlmax  indicates  the  average  maximum  vector  length,  and  vl  indicates  the  average  vector  length. 


Loops  with  Cross-Iteration  Dependencies 

Many  loops  are  non-vectorizable  because  they  contain  loop-carried 
data  dependencies  from  one  iteration  to  the  next.  Nevertheless, 
there  may  be  ample  loop  parallelism  available  when  there  are  oper¬ 
ations  in  the  loop  which  are  not  on  the  critical  path  of  the  cross¬ 
iteration  dependency.  The  vector-thread  architecture  allows  the 
parallelism  to  be  exposed  by  making  the  cross-iteration  (cross- 
VP)  data  transfers  explicit.  In  contrast  to  software  pipelining  for 
a  VLIW  architecture,  the  vector-thread  code  need  only  schedule 
instructions  locally  in  one  loop  iteration.  As  the  code  executes  on 
a  vector-thread  machine,  the  dependencies  between  iterations  re¬ 
solve  dynamically  and  the  performance  automatically  adapts  to  the 
software  critical  path  and  the  available  hardware  resources. 

Mediabench  ADPCM  contains  one  such  loop  (similar  to  Fig¬ 
ure  8)  with  two  loop-carried  dependencies  that  can  propagate  in 
parallel.  The  loop  is  mapped  to  a  single  SCALE  AIB  with  35  VP 
instructions.  Cross-iteration  dependencies  limit  the  initiation  inter¬ 


val  to  5  cycles,  yielding  a  maximum  SCALE  IPC  of  35/5  =  7. 
SCALE  sustains  an  average  of  6.7  compute-ops  per  cycle  and 
achieves  a  speedup  of  7.9  x  compared  to  the  control  processor. 

The  two  MiBench  cryptographic  kernels,  sha  and  ri  jndael, 
have  many  loop-carried  dependences.  The  sha  mapping  uses  5 
cross- VP  data  transfers,  while  the  ri  jndael  mapping  vector¬ 
izes  a  short  four-iteration  inner  loop.  SCALE  is  able  to  exploit 
instruction-level  parallelism  within  each  iteration  of  these  kernels 
by  using  multiple  clusters,  but,  as  shown  in  Figure  14,  performance 
also  improves  as  more  lanes  are  added. 

Data-Parallel  Loops  with  Inner-Loops 

Often  an  inner  loop  has  little  or  no  available  parallelism,  but 
the  outer  loop  iterations  can  run  concurrently.  For  example,  the 
EEMBC  lookup  code  models  a  router  using  a  Patricia  Trie  to 
perform  IP  Route  Lookup.  The  benchmark  searches  the  trie  for 
each  IP  address  in  an  input  vector,  with  each  lookup  chasing  point¬ 
ers  through  around  5-12  nodes  of  the  trie.  Very  little  parallelism  is 
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Figure  14:  Performance  Results:  Twenty-two  benchmarks  illustrate  the  performance  of  four  SCALE  configurations  (1  Lane,  2  Lanes,  4  Lanes, 
8  Lanes)  compared  to  various  industry  architectures.  Speedup  is  relative  to  the  SCALE  MIPS  control  processor.  The  EEMBC  benchmarks  are 
compared  in  terms  of  iterations  per  second,  while  the  non-EEMBC  benchmarks  are  compared  in  terms  of  cycles  to  complete  the  benchmark  kernel. 
These  numbers  correspond  to  the  Kernel  Speedup  column  in  Table  2.  Eor  benchmarks  with  multiple  input  data  sets,  results  for  a  single  representative 
data  set  are  shown  with  the  data  set  name  indicated  after  a  forward  slash. 


available  in  each  lookup,  but  many  lookups  can  run  simultaneously. 

In  the  SCALE  implementation,  each  VP  handles  one  IP  lookup 
using  thread-fetches  to  traverse  the  trie.  The  ample  thread  paral¬ 
lelism  keeps  the  lanes  busy  executing  6.3  ops/cycle  by  interleaving 
the  execution  of  multiple  VPs  to  hide  memory  latency.  Vector- 
fetches  provide  an  advantage  over  a  pure  multithreaded  machine  by 
efficiently  distributing  work  to  the  VPs,  avoiding  contention  for  a 
shared  work  queue.  Additionally,  vector-load  commands  optimize 
the  loading  of  IP  addresses  before  the  VP  threads  are  launched. 

Reductions  and  Data-Dependent  Loop  Exit  Conditions 

SCALE  provides  efficient  support  for  arbitrary  reduction  opera¬ 
tions  by  using  shared  registers  to  accumulate  partial  reduction  re¬ 
sults  from  multiple  VPs  on  each  lane.  The  shared  registers  are  then 
combined  across  all  lanes  at  the  end  of  the  loop  using  the  cross-VP 
network.  The  pktf  low  code  uses  reductions  to  count  the  number 
of  packets  processed. 

Loops  with  data-dependent  exit  conditions  (“while”  loops)  are 
difficult  to  parallelize  because  the  number  of  iterations  is  not  known 
in  advance.  Eor  example,  the  strcmp  and  strcpy  standard  C  li¬ 
brary  routines  used  in  the  text  benchmark  loop  until  the  string 
termination  character  is  seen.  The  cross- VP  network  can  be  used 
to  communicate  exit  status  across  VPs  but  this  serializes  execution. 
Alternatively,  iterations  can  be  executed  speculatively  in  parallel 
and  then  nullified  after  the  correct  exit  iteration  is  determined.  The 
check  to  find  fhe  exit  condition  is  coded  as  a  cross-iteration  reduc¬ 
tion  operation.  The  text  benchmark  executes  most  of  its  code  on 
the  control  processor,  but  uses  this  technique  for  the  string  routines 
to  attain  a  L5x  overall  speedup. 

Free-Running  Threads 

When  structured  loop  parallelism  is  not  available,  VPs  can  be  used 
to  exploit  arbitrary  thread  parallelism.  With  free-running  threads. 


the  control  processor  interaction  is  eliminated.  Each  VP  thread  runs 
in  a  continuous  loop  getting  tasks  from  a  work-queue  accessed  us¬ 
ing  atomic  memory  operations.  An  advantage  of  this  method  is  that 
it  achieves  good  load-balancing  between  the  VPs  and  can  keep  the 
VTU  constantly  utilized. 

Three  benchmarks  were  mapped  with  free-running  threads.  The 
pnt  r  ch  benchmark  searches  for  tokens  in  a  doubly-linked  list,  and 
allows  up  to  five  searches  to  execute  in  parallel.  The  qsort  bench¬ 
mark  uses  quick-sort  to  alphabetize  a  list  of  words.  The  SCALE 
mapping  recursively  divides  the  input  set  and  assigns  VP  threads 
to  sort  partitions,  using  VP  function  calls  to  implement  the  com¬ 
pare  routine.  The  benchmark  achieves  2.3  ops/cycle  despite  a  high 
cache  miss  rate.  The  ospf  benchmark  has  little  available  paral¬ 
lelism  and  the  SCALE  implementation  maps  the  code  to  a  single 
VP  to  exploit  ILP  for  a  small  speedup. 

Mixed  Parallelism 

Some  codes  exploit  a  mixture  of  parallelism  types  to  accelerate  per¬ 
formance  and  improve  efficiency.  The  garbage  collection  portion  of 
the  lisp  interpreter  (li)  is  split  into  two  phases:  mark,  which  tra¬ 
verses  a  tree  of  currently  live  lisp  nodes  and  sets  a  flag  bit  in  every 
visited  node,  and  sweep,  which  scans  through  the  array  of  nodes 
and  returns  a  linked  list  containing  all  of  the  unmarked  nodes.  Dur¬ 
ing  mark,  the  SCALE  code  sets  up  a  queue  of  nodes  to  be  pro¬ 
cessed  and  uses  a  stripmine  loop  to  examine  the  nodes,  mark  them, 
and  enqueue  their  children.  In  the  sweep  phase,  VPs  are  assigned 
segments  of  the  allocation  array  and  then  each  construct  a  list  of  un¬ 
marked  nodes  within  their  segment  in  parallel.  Once  the  VP  threads 
terminate,  the  control  processor  vector-fetches  an  AIB  that  stitches 
the  individual  lists  together  using  cross- VP  data  transfers,  thus  pro¬ 
ducing  the  intended  structure.  Although  the  garbage  collector  has 
a  high  cache  miss  rate,  the  high  degree  of  parallelism  exposed  in 
this  way  allows  SCALE  to  sustain  2.8  operations/cycle  and  attain  a 
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speedup  of  5.5  over  the  control  processor  alone. 

4.4  Locality  and  Efficiency 

The  strength  of  the  SCALE  VT  architecture  is  its  ability  to  cap¬ 
ture  a  wide  variety  of  parallelism  in  applications  while  using  simple 
microarchitectural  mechanisms  that  exploit  locality  in  both  control 
and  data  hierarchies. 

A  VT  machine  amortizes  control  overhead  by  exploiting  the  lo¬ 
cality  exposed  by  AIBs  and  vector- fetch  commands,  and  by  factor¬ 
ing  out  common  control  code  to  run  on  the  control  processor.  A 
vector-fetch  broadcasts  an  AIB  address  to  all  lanes  and  each  lane 
performs  a  single  tag-check  to  determine  if  the  AIB  is  cached.  On 
a  hit,  an  execute  directive  is  sent  to  the  clusters  which  then  retrieve 
the  instructions  within  the  AIB  using  a  short  (5-bit)  index  into  the 
small  AIB  cache.  The  cost  of  each  instruction  fetch  is  on  par  with 
a  register  file  read.  On  an  AIB  miss,  a  vector-fetch  will  broadcast 
AIBs  to  refill  all  lanes  simultaneously.  The  vector-fetch  ensures  an 
AIB  will  be  reused  by  each  VP  in  a  lane  before  any  eviction  is  pos¬ 
sible.  When  an  AIB  contains  only  a  single  instruction  on  a  cluster, 
a  vector-fetch  will  keep  the  ALU  control  lines  fixed  while  each  VP 
executes  its  operation,  further  reducing  control  energy. 

As  an  example  of  amortizing  control  overhead,  rbgyiq  runs  on 
SCALE  with  a  vector-length  of  120  and  vector-fetches  an  AIB  with 
29  VP  instructions.  Thus,  each  vector-fetch  executes  3,480  instruc¬ 
tions  on  the  VTU,  870  instructions  per  tag-check  in  each  lane.  This 
is  an  extreme  example,  but  vector-fetches  commonly  execute  10s- 
100s  of  instructions  per  tag-check  even  for  non-vectorizable  loops 
such  as  a  dp  cm  (Table  4). 

AIBs  also  help  in  the  data  hierarchy  by  allowing  the  use  of  chain 
registers,  which  reduces  register  file  energy;  and  sharing  of  tem¬ 
porary  registers,  which  reduces  the  register  file  size  needed  for  a 
large  number  of  VPs.  Table  5  shows  that  chain  registers  comprise 


around  32%  of  all  register  sources  and  44%  of  all  register  destina¬ 
tions.  Table  3  shows  that  across  all  benchmarks,  VP  configurations 
use  an  average  of  8.5  shared  and  6.2  private  registers,  with  an  av¬ 
erage  maximum  vector  length  above  64  (16  VPs  per  lane).  The 
significant  variability  in  register  requirements  for  different  kernels 
stresses  the  importance  of  allowing  software  to  configure  VPs  with 
just  enough  of  each  register  type. 

Vector-memory  commands  enforce  spatial  locality  by  moving 
data  between  memory  and  the  VP  registers  in  groups.  This  im¬ 
proves  performance  and  saves  memory  system  energy  by  avoid¬ 
ing  the  additional  arbitration,  tag-checks,  and  bank  conflicts  that 
would  occur  if  each  VP  requested  elements  individually.  Table  5 
shows  the  reduction  in  memory  addresses  from  vector-memory 
commands.  The  maximum  improvement  is  a  factor  of  four,  when 
each  vector  cache  access  loads  or  stores  one  element  per  lane.  The 
VT  architecture  can  exploit  memory  data-parallelism  even  in  loops 
with  non-data-parallel  compute.  For  example,  the  fbital,  text, 
and  adpcm.enc  benchmarks  use  vector-memory  commands  to  ac¬ 
cess  data  for  vector-fetched  AIBs  with  cross- VP  dependencies. 

Table  5  shows  that  the  SCALE  data  cache  is  effective  at  reduc¬ 
ing  DRAM  bandwidth  for  most  of  the  benchmarks.  Two  excep¬ 
tions  are  the  pktf  low  and  li  benchmarks  for  which  the  DRAM 
bytes  transferred  exceed  the  total  bytes  accessed.  The  current  de¬ 
sign  always  transfers  32-byte  lines  on  misses,  but  support  for  non¬ 
allocating  loads  and  stores  could  help  reduce  the  bandwidth  for 
these  benchmarks. 

Clustering  in  SCALE  is  area  and  energy  efficient  and  cluster  de¬ 
coupling  improves  performance.  The  clusters  each  contain  only 
a  subset  of  all  possible  functional  units  and  a  small  register  file 
with  few  ports,  reducing  size  and  wiring  energy.  Each  cluster  ex¬ 
ecutes  compute-ops  and  inter-cluster  transport  operations  in  order, 
requiring  only  simple  interlock  logic  with  no  inter-thread  arbitra- 
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tion  or  dynamic  inter-cluster  bypass  detection.  Independent  control 
on  each  cluster  enables  decoupled  cluster  execution  to  hide  large 
inter-cluster  or  memory  latencies.  This  provides  a  very  cheap  form 
of  SMT  where  each  cluster  can  be  executing  code  for  different  VPs 
on  the  same  cycle  (Figure  12). 

5.  Related  Work 

The  VT  architecture  draws  from  earlier  vector  architectures  [9], 
and  like  vector  microprocessors  [14,  6,  3]  the  SCALE  VT  imple¬ 
mentation  provides  high  throughput  at  low  complexity.  Similar  to 
CODE  [5],  SCALE  uses  decoupled  clusters  to  simplify  chaining 
control  and  to  reduce  the  cost  of  a  large  vector  register  hie  support¬ 
ing  many  functional  units.  However,  whereas  CODE  uses  register 
renaming  to  hide  clusters  from  software,  SCALE  reduces  hardware 
complexity  hy  exposing  clustering  and  statically  partitioning  inter¬ 
cluster  transport  and  writeback  operations. 

The  Imagine  [8]  stream  processor  is  similar  to  vector  machines, 
with  the  main  enhancement  being  the  addition  of  stream  load  and 
store  instructions  that  pack  and  unpack  arrays  of  multi-Held  records 
stored  in  DRAM  into  multiple  vector  registers,  one  per  Held.  In 
comparison,  SCALE  uses  a  conventional  cache  to  enable  unit- 
stride  transfers  from  DRAM,  and  provides  segment  vector-memory 
commands  to  transfer  arrays  of  multi-Held  records  between  the 
cache  and  VP  registers.  Like  SCALE,  Imagine  improves  register 
Hie  locality  compared  with  traditional  vector  machines  hy  execut¬ 
ing  all  operations  for  one  loop  iteration  before  moving  to  the  next. 
However,  Imagine  instructions  use  a  low-level  VLIW  ISA  that  ex¬ 
poses  machine  details  such  as  number  of  physical  registers  and 
lanes,  whereas  SCALE  provides  a  higher-level  abstraction  based 
on  VPs  and  AIBs. 

VT  enhances  the  traditional  vector  model  to  support  loops  with 
cross-iteration  dependencies  and  arbitrary  internal  control  flow. 
Chiueh’s  multi-threaded  vectorization  [1]  extends  a  vector  ma¬ 
chine  to  handle  loop-carried  dependencies,  but  is  limited  to  a  sin¬ 
gle  lane  and  requires  the  compiler  to  have  detailed  knowledge  of 
all  functional  unit  latencies.  Jesshope’s  micro-threading  [2]  uses 
a  vector-fetch  to  launch  micro-threads  which  each  execute  one 
loop  iteration,  but  whose  execution  is  dynamically  scheduled  on 
a  per-instruction  basis.  In  contrast  to  VT’s  low-overhead  direct 
cross-VP  data  transfers,  cross-iteration  synchronization  is  done  us¬ 
ing  full/empty  bits  on  shared  global  registers.  Like  VT,  Multi¬ 
scalar  [12]  statically  determines  loop-carried  register  dependencies 
and  uses  a  ring  to  pass  cross-iteration  values.  But  Multiscalar  uses 
speculative  execution  with  dynamic  checks  for  memory  dependen¬ 
cies,  while  VT  dispatches  multiple  non- speculative  iterations  si¬ 
multaneously.  Multiscalar  can  execute  a  wider  range  of  loops  in 
parallel,  but  VT  can  execute  many  common  parallel  loop  types  with 
much  simpler  logic  and  while  using  vector-memory  operations. 

Several  other  projects  are  developing  processors  capable  of  ex¬ 
ploiting  multiple  forms  of  application  parallelism.  The  Raw  [13] 
project  connects  a  tiled  array  of  simple  processors.  In  contrast  to 
scale’s  direct  inter-cluster  data  transfers  and  cluster  decoupling, 
inter-tile  communication  on  Raw  is  controlled  by  programmed 
switch  processors  and  must  be  statically  scheduled  to  tolerate  laten¬ 
cies.  The  Smart  Memories  [7]  project  has  developed  an  architecture 
with  conflgurable  processing  tiles  which  support  different  types  of 
parallelism,  but  it  has  different  instruction  sets  for  each  type  and 
requires  a  reconflguration  step  to  switch  modes.  The  TRIPS  pro¬ 
cessor  [10]  similarly  must  explicitly  morph  between  instruction, 
thread,  and  data  parallelism  modes.  These  mode  switches  limit  the 
ability  to  exploit  multiple  forms  of  parallelism  at  a  flne-grain,  in 
contrast  to  SCALE  which  seamlessly  combines  vector  and  threaded 
execution  while  also  exploiting  local  instruction-level  parallelism. 


6.  Conclusion 

The  vector-thread  architectural  paradigm  allows  software  to 
more  efficiently  encode  the  parallelism  and  locality  present  in  many 
applications,  while  the  structure  provided  in  the  hardware-software 
interface  enables  high-performance  implementations  that  are  effi¬ 
cient  in  area  and  power.  The  VT  architecture  unifles  support  for  all 
types  of  parallelism  and  this  flexibility  enables  new  ways  of  paral¬ 
lelizing  codes,  for  example,  by  allowing  vector-memory  operations 
to  feed  directly  into  threaded  code.  The  SCALE  prototype  demon¬ 
strates  that  the  VT  paradigm  is  well-suited  to  embedded  applica¬ 
tions,  allowing  a  single  relatively  small  design  to  provide  competi¬ 
tive  performance  across  a  range  of  application  domains.  Although 
this  paper  has  focused  on  applying  VT  to  the  embedded  domain, 
we  anticipate  that  the  vector-thread  model  will  be  widely  applicable 
in  other  domains  including  scientiflc  computing,  high-performance 
graphics  processing,  and  machine  learning. 
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Abstract 


Wireless  transmission  of  a  bit  can  require  over  1000  times  more  energy  than  a  single  32-bit  computation.  It  would 
therefore  seem  desirable  to  perform  significant  computation  to  reduce  the  number  of  bits  transmitted.  If  the  energy 
required  to  compress  data  is  less  than  the  energy  required  to  send  it,  there  is  a  net  energy  savings  and  consequently, 
a  longer  battery  life  for  portable  computers.  This  paper  reports  on  the  energy  of  lossless  data  compressors  as  mea¬ 
sured  on  a  StrongARM  SA-110  system.  We  show  that  with  several  typical  compression  tools,  there  is  a  net  energy 
increase  when  compression  is  applied  before  transmission.  Reasons  for  this  increase  are  explained,  and  hardware- 
aware  programming  optimizations  are  demonstrated.  When  applied  to  Unix  compress,  these  optimizations  improve 
energy  efficiency  by  51%.  We  also  explore  the  fact  that,  for  many  usage  models,  compression  and  decompression 
need  not  be  performed  by  the  same  algorithm.  By  choosing  the  lowest-energy  compressor  and  decompressor  on  the 
test  platform,  rather  than  using  default  levels  of  compression,  overall  energy  to  send  compressible  web  data  can  be 
reduced  31%.  Energy  to  send  harder-to-compress  English  text  can  be  reduced  57%.  Compared  with  a  system  using  a 
single  optimized  application  for  both  compression  and  decompression,  the  asymmetric  scheme  saves  1 1%  or  12%  of 
the  total  energy  depending  on  the  dataset. 


1  Introduction 

Wireless  communication  is  an  essential  component  of 
mobile  computing,  but  the  energy  required  for  transmis¬ 
sion  of  a  single  bit  has  been  measured  to  be  over  1000 
times  greater  than  a  single  32-bit  computation.  Thus,  if 
1000  computation  operations  can  compress  data  by  even 
one  bit,  energy  should  be  saved.  However,  accessing 
memory  can  be  over  200  times  more  costly  than  compu¬ 
tation  on  our  test  platform,  and  it  is  memory  access  that 
dominates  most  lossless  data  compression  algorithms.  In 
fact,  even  moderate  compression  (e.g.  gzip  -6)  can 
require  so  many  memory  accesses  that  one  observes  an 
increase  in  the  overall  energy  required  to  send  certain 
data. 

While  some  types  of  data  (e.g.,  audio  and  video)  may 
accept  some  degradation  in  quality,  other  data  must  be 
transmitted  faithfully  with  no  loss  of  information.  Ei- 
delity  can  not  be  sacrificed  to  reduce  energy  as  is  done 
in  related  work  on  lossy  compression.  Eortunately,  an 
understanding  of  a  program’s  behavior  and  the  energy 
required  by  major  hardware  components  can  be  used  to 
reduce  energy.  The  ability  to  efficiently  perform  efficient 
lossless  compression  also  provides  second-order  benefits 
such  as  reduction  in  packet  loss  and  less  contention  for 


the  fixed  wireless  bandwidth.  Concretely,  if  n  bits  have 
been  compressed  to  m  bits  (n  >  m);  c  is  the  cost  of 
compression  and  decompression;  and  w  is  the  cost  per 
bit  of  transmission  and  reception;  compression  is  energy 
efficient  if  <  w.  This  paper  examines  the  elements 
of  this  inequality  and  their  relationships. 

We  measure  the  energy  requirements  of  several  loss¬ 
less  data  compression  schemes  using  the  “Skiff”  plat¬ 
form  developed  by  Compaq  Cambridge  Research  Labs. 
The  Skiff  is  a  StrongARM-based  system  designed  with 
energy  measurement  in  mind.  Energy  usage  for  CPU, 
memory,  network  card,  and  peripherals  can  be  measured 
individually.  The  platform  is  similar  to  the  popular  Com¬ 
paq  iPAQ  handheld  computer,  so  the  results  are  relevant 
to  handheld  hardware  and  developers  of  embedded  soft¬ 
ware.  Several  families  of  compression  algorithms  are  an¬ 
alyzed  and  characterized,  and  it  is  shown  that  carelessly 
applying  compression  prior  to  transmission  may  cause  an 
overall  energy  increase.  Behaviors  and  resource-usage 
patterns  are  highlighted  which  allow  for  energy-efficient 
lossless  compression  of  data  by  applications  or  network 
drivers.  We  focus  on  situations  in  which  the  mixture  of 
high  energy  network  operations  and  low  energy  proces¬ 
sor  operations  can  be  adjusted  so  that  overall  energy  is 
lower.  This  is  possible  even  if  the  number  of  total  opera- 


18 


tions,  or  time  to  complete  them,  increases.  Finally,  a  new 
energy-aware  data  compression  strategy  composed  of  an 
asymmetric  compressor  and  decompressor  is  presented 
and  measured. 

Section  2  describes  the  experimental  setup  including 
equipment,  workloads,  and  the  choice  of  compression 
applications.  Section  3  begins  with  the  measurement 
of  an  encouraging  communication-computation  gap,  but 
shows  that  modern  compression  tools  do  not  exploit 
the  the  low  relative  energy  of  computation  versus  com¬ 
munication.  Factors  which  limit  energy  reduction  are 
presented.  Section  4  applies  an  understanding  of  these 
factors  to  reduce  overall  energy  of  transmission  though 
hardware-conscious  optimizations  and  asymmetric  com¬ 
pression  choices.  Section  5  discusses  related  work,  and 
Section  6  concludes. 

2  Experimental  setup 

While  simulators  may  be  tuned  to  provide  reason¬ 
ably  accurate  estimations  of  a  particular  system’s  energy, 
observing  real  hardware  ensures  that  complex  interac¬ 
tions  of  components  are  not  overlooked  or  oversimpli¬ 
fied.  This  section  gives  a  brief  description  of  our  hard¬ 
ware  and  software  platform,  the  measurement  methodol¬ 
ogy,  and  benchmarks. 

2.1  Equipment 

The  Compaq  Personal  Server,  codenamed  “Skiff,”  is 
essentially  an  initial,  “spread-out”  version  of  the  Com¬ 
paq  iPAQ  built  for  research  purposes  [13].  Powered  by  a 
233  MHz  StrongARM  SA-1 10  [29,  17],  the  Skiff  is  com¬ 
putationally  similar  to  the  popular  Compaq  iPAQ  hand¬ 
held  (an  SA-1 110  [18]  based  device).  For  wireless  net¬ 
working,  we  add  a  five  volt  Enterasys  802.11b  wireless 
network  card  (part  number  CSIBD-AA).  The  Skiff  has 
32  MB  of  DRAM,  support  for  the  Universal  Serial  Bus, 
a  RS232  Serial  Port,  Ethernet,  two  Cardbus  sockets,  and 
a  variety  of  general  purpose  I/O.  The  Skiff  PCB  boasts 
separate  power  planes  for  its  CPU,  memory  and  mem¬ 
ory  controller,  and  other  peripherals  allowing  each  to  be 
measured  in  isolation  (Eigure  1).  With  a  Cardbus  exten¬ 
der  card,  one  can  isolate  the  power  used  by  a  wireless 
network  card  as  well.  A  programmable  multimeter  and 
sense  resistor  provide  a  convenient  way  to  examine  en¬ 
ergy  in  a  active  system  with  error  less  than  5%  [47]. 

The  Skiff  runs  ARM/Linux  2.4.2-rmkl-npl-hh2  with 
PCMCIA  Card  Services  3.1.24.  The  Skiff  has  only  4  MB 
of  non-volatile  flash  memory  to  contain  a  file  system,  so 
the  root  filesystem  is  mounted  via  NES  using  the  wired 
ethernet  port.  Eor  benchmarks  which  require  file  system 
access,  the  executable  and  input  dataset  is  brought  into 
RAM  before  timing  begins.  This  is  verified  by  observing 


Figure  1.  Simplified  Skiff  power  schematic 

the  cessation  of  traffic  on  the  network  once  the  program 
completes  loading.  I/O  is  conducted  in  memory  using 
a  modified  SPEC  harness  [42]  to  avoid  the  large  cost  of 
accessing  the  network  filesystem. 

2.2  Benchmarks 

Eigure  2  shows  the  performance  of  several  lossless 
data  compression  applications  using  metrics  of  compres¬ 
sion  ratio,  execution  time,  and  static  memory  alloca¬ 
tion.  The  datasets  are  the  first  megabyte  (English  books 
and  a  bibliography)  from  the  Calgary  Corpus  [5]  and 
one  megabyte  of  easily  compressible  web  data  (mostly 
HTML,  Javascript,  and  CSS)  obtained  from  the  home- 
pages  of  the  Internet’s  most  popular  websites  [32,  25]. 
Graphics  were  omitted  as  they  are  usually  in  compressed 
form  already  and  can  be  recognized  by  application-layer 
software  via  their  file  extensions.  Most  popular  reposi¬ 
tories  ([4,  10,  11])  for  comparison  of  data  compression 
do  not  examine  the  memory  footprint  required  for  com¬ 
pression  or  decompression.  Though  static  memory  usage 
may  not  always  reflect  the  size  of  the  application’s  work¬ 
ing  set,  it  is  an  essential  consideration  in  mobile  com¬ 
puting  where  memory  is  a  more  precious  resource.  A 
detailed  look  at  the  memory  used  by  each  application, 
and  its  effect  on  time,  compression  ratio,  and  energy  will 
be  presented  in  Section  3.3. 

Eigure  2  confirms  that  we  have  chosen  an  array  of  ap- 
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Figure  2.  Benchmark  comparison  by  traditional  metrics 


plications  that  span  a  range  of  compression  ratios  and 
execution  times.  Each  application  represents  a  differ¬ 
ent  family  of  compression  algorithms  as  noted  in  Table 
1.  Consideration  was  also  given  to  popularity  and  doc¬ 
umentation,  as  well  as  quality,  parameterizability,  and 
portability  of  the  source  code.  The  table  includes  the 
default  parameters  used  with  each  program.  To  avoid 
unduly  handicapping  any  algorithm,  it  is  important  to 
work  with  well-implemented  code.  Mature  applications 
such  as  compress,  bzip2,  and  zUb  reflect  a  series  of  opti¬ 
mizations  that  have  been  applied  since  their  introduction. 
While  PPMd  is  an  experimental  program,  it  is  effectively 
an  optimization  of  the  Prediction  by  Partial  Match  (PPM) 
compressors  that  came  before  it.  LZO  represents  an  ap¬ 
proach  for  achieving  great  speed  with  UZll .  Each  of  the 
five  applications  is  summarized  below  assuming  some 
familiarity  with  each  algorithm.  A  more  complete  treat¬ 
ment  with  citations  may  be  found  in  [36]. 

zlib  combines  LZ77  and  Huffman  coding  to  form  an 
algorithm  known  as  “deflate.”  The  LZ77  sliding  win¬ 
dow  size  and  hash  table  memory  size  may  be  set  by  the 
user.  LZ77  tries  to  replace  a  string  of  symbols  with  a 
pointer  to  the  longest  prefix  match  previously  encoun¬ 
tered.  A  larger  window  improves  the  ability  to  find  such 
a  match.  More  memory  allows  for  less  collisions  in  the 
zlib  hash  table.  Users  may  also  set  an  “effort”  parame¬ 
ter  which  dictates  how  hard  the  compressor  should  try  to 
extend  matches  it  finds  in  its  history  buffer,  zlib  is  the 
library  form  of  the  popular  gzip  utility  (the  library  form 
was  chosen  as  it  provides  more  options  for  trading  off 
memory  and  performance).  Unless  specified,  it  is  con¬ 
figured  with  parameters  similar  to  gzip. 

LZO  is  a  compression  library  meant  for  “real-time” 
compression.  Like  zlib,  it  uses  LZ77  with  a  hash  table 
to  perform  searches.  LZO  is  unique  in  that  its  hash  table 
can  be  sized  to  fit  in  16KB  of  memory  so  it  can  remain 
in  cache.  Its  small  footprint,  coding  style  (it  is  written 
completely  with  macros  to  avoid  function  call  overhead), 
and  ability  to  read  and  write  data  “in-place”  without  ad¬ 
ditional  copies  make  LZO  extremely  fast.  In  the  interest 
of  speed,  its  hash  table  can  only  store  pointers  to  4096 


matches,  and  no  effort  is  made  to  find  the  longest  match. 
Match  length  and  offset  are  encoded  more  simply  than  in 
zlib. 

compress  is  a  popular  Unix  utility.  It  implements  the 
LZW  algorithm  with  codewords  beginning  at  nine  bits. 
Though  a  bit  is  wasted  for  each  single  8-bit  character, 
once  longer  strings  have  been  seen,  they  may  be  replaced 
with  short  codes.  When  all  nine-bit  codes  have  been 
used,  the  codebook  size  is  doubled  and  the  use  of  ten- 
bit  codes  begins.  This  doubling  continues  until  codes  are 
sixteen  bits  long.  The  dictionary  becomes  static  once  it 
is  entirely  full.  Whenever  compress  detects  decreasing 
compression  ratio,  the  dictionary  is  cleared  and  the  pro¬ 
cess  beings  anew.  Dictionary  entries  are  stored  in  a  hash 
table.  Hashing  allows  an  average  constant-time  access 
to  any  entry,  but  has  the  disadvantage  of  poor  spatial  lo¬ 
cality  when  combining  multiple  entries  to  form  a  string. 
Despite  the  random  dispersal  of  codes  to  the  table,  com¬ 
mon  strings  may  benefit  from  temporal  locality.  To  re¬ 
duce  collisions,  the  table  should  be  sparsely  filled  which 
results  in  wasted  memory.  During  decompression,  each 
table  entry  may  be  inserted  without  collision. 

PPMd  is  a  recent  implementation  of  the  PPM  algo¬ 
rithm.  Windows  users  may  unknowingly  be  using  PPMd 
as  it  is  the  text  compression  engine  in  the  popular  Win- 
RAR  program.  PPM  takes  advantage  of  the  fact  that  the 
occurrence  of  a  certain  symbol  can  be  highly  dependent 
on  its  context  (the  string  of  symbols  which  preceded  it). 
The  PPM  scheme  maintains  such  context  information  to 
estimate  the  probability  of  the  next  input  symbol  to  ap¬ 
pear.  An  arithmetic  coder  uses  this  stream  of  probabil¬ 
ities  to  efficiently  code  the  source.  As  the  model  be¬ 
comes  more  accurate,  the  occurrence  of  a  highly  likely 
symbol  requires  fewer  bits  to  encode.  Clearly,  longer 
contexts  will  improve  the  probability  estimation,  but  it 
requires  time  to  amass  large  contexts  (this  is  similar  to 
the  startup  effect  in  LZ78).  To  account  for  this,  “es¬ 
cape  symbols”  exist  to  progressively  step  down  to  shorter 
context  lengths.  This  introduces  a  trade-off  in  which  en¬ 
coding  a  long  series  of  escape  symbols  can  require  more 
space  than  is  saved  by  the  use  of  large  contexts.  Stor- 
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Application  (Version) 

Source 

Algorithm 

Notes  (defaults) 

bzip2  (0.1pl2) 

[37] 

BWT 

RLE^BWT->MTF^RLE->HUFF  (900k  block  size) 

compress  (4.0) 

[21] 

LZW 

modified  Unix  Compress  based  on  Spec95  (16  bit  codes  (maximum)) 

LZO(1.07) 

[33] 

LZ77 

Favors  speed  over  compression  (lzolx_12.  4K  entry  hash  table  uses  16KB) 

PPMd  (variant  I) 

[40] 

PPM 

used  in  “rar”  compressor  (Order  4,  10MB  memory,  restart  model) 

zlib  (1.1.4) 

[9] 

LZ77 

library  form  of  gzip  (Chaining  level  6  /  32K  Window  /  32K  Hash  Table) 

Table  1.  Compression  applications  and  their  algorithms 


ing  and  searching  through  each  context  accounts  for  the 
large  memory  requirements  of  PPM  schemes.  The  length 
of  the  maximum  context  can  be  varied  by  PPMd,  but 
defaults  to  four.  When  the  context  tree  fills  up,  PPMd 
can  clear  and  start  from  scratch,  freeze  the  model  and 
continue  statically,  or  prune  sections  of  the  tree  until  the 
model  fits  into  memory. 

bzip2  is  based  on  the  Burrows  Wheeler  Transform 
(BWT)  [8].  The  BWT  converts  a  block  S  of  length  n 
into  a  pair  consisting  of  a  permutation  of  S  (call  it  L) 
and  an  integer  in  the  interval  [0..n  —  1].  More  impor¬ 
tant  than  the  details  of  the  transformation  is  its  effect. 
The  transform  collects  groups  of  identical  input  symbols 
such  that  the  probability  of  finding  a  symbol  s  in  a  re¬ 
gion  of  L  is  very  high  if  another  instance  of  s  is  nearby. 
Such  an  L  can  be  processed  with  a  “move-to-front”  coder 
which  will  yield  a  series  consisting  of  a  small  alphabet: 
runs  of  zeros  punctuated  with  low  numbers  which  in  turn 
can  be  processed  with  a  Huffman  or  Arithmetic  coder. 
For  processing  efficiency,  long  runs  can  be  filtered  with  a 
run  length  encoder.  As  block  size  is  increased,  compres¬ 
sion  ratio  improves.  Diminishing  returns  (with  English 
text)  do  not  occur  until  block  size  reaches  several  tens  of 
megabytes.  Unlike  the  other  algorithms,  one  could  con¬ 
sider  BWT  to  take  advantage  of  symbols  which  appear  in 
the  “future”,  not  just  those  that  have  passed.  bzip2  reads 
in  blocks  of  data,  run-length-encoding  them  to  improve 
sort  speed.  It  then  applies  the  BWT  and  uses  a  variant  of 
move-to-front  coding  to  produce  a  compressible  stream. 
Though  the  alphabet  may  be  large,  codes  are  only  created 
for  symbols  in  use.  This  stream  is  run-length  encoded  to 
remove  any  long  runs  of  zeros.  Finally  Huffman  encod¬ 
ing  is  applied.  To  speed  sorting,  bzip2  applies  a  modi¬ 
fied  quicksort  which  has  memory  requirements  over  five 
times  the  size  of  the  block. 

2.3  Performance  and  implementation  concerns 

A  compression  algorithm  may  be  implemented  with 
many  different,  yet  reasonable,  data  structures  (including 
binary  tree,  splay  tree,  trie,  hash  table,  and  list)  and  yield 
vastly  different  performance  results  [3].  The  quality  and 
applicability  of  the  implementation  is  as  important  as  the 
underlying  algorithm.  This  section  has  presented  imple¬ 
mentations  from  each  algorithmic  family.  By  choosing 


a  top  representative  in  each  family,  the  implementation 
playing  field  is  leveled,  making  it  easier  to  gain  insight 
into  the  underlying  algorithm  and  its  influence  on  energy. 
Nevertheless,  it  is  likely  that  each  application  can  be  op¬ 
timized  further  (Section  4.1  shows  the  benefit  of  opti¬ 
mization)  or  use  a  more  uniform  style  of  I/O.  Thus,  eval¬ 
uation  must  focus  on  inherent  patterns  rather  than  mak¬ 
ing  a  direct  quantitative  comparison. 

3  Observed  Energy  of  Communication, 
Computation,  and  Compression 

In  this  section,  we  observe  that  over  1000  32  bit  ADD 
instructions  can  be  executed  by  the  Skiff  with  the  same 
amount  of  energy  it  requires  to  send  a  single  bit  via  wire¬ 
less  ethernet.  This  fact  motivates  the  investigation  of  pre¬ 
transmission  compression  of  data  to  reduce  overall  en¬ 
ergy.  Initial  experiments  reveal  that  reducing  the  number 
of  bits  to  send  does  not  always  reduce  the  total  energy  of 
the  task.  This  section  elaborates  on  both  of  these  points 
which  necessitate  the  in-depth  experiments  of  Section 
3.3. 

3.1  Raw  Communication-to-Computation 
Energy  Ratio 

To  quantify  the  gap  between  wireless  communication 
and  computation,  we  have  measured  wireless  idle,  send, 
and  receive  energies  on  the  Skiff  platform.  To  eliminate 
competition  for  wireless  bandwidth  from  other  devices 
in  the  lab,  we  established  a  dedicated  channel  and  ran  the 
network  in  ad-hoc  mode  consisting  of  only  two  wireless 
nodes.  We  streamed  UDP  packets  from  one  node  to  the 
other;  UDP  was  used  to  eliminate  the  effects  of  waiting 
for  an  ACK.  This  also  insures  that  receive  tests  measure 
only  receive  energy  and  send  tests  measure  only  send  en¬ 
ergy.  This  setup  is  intended  to  find  the  minimum  network 
energy  by  removing  arbitration  delay  and  the  energy  of 
TCP  overhead  to  avoid  biasing  our  results. 

With  the  measured  energy  of  the  transmission  and  the 
size  of  data  file,  the  energy  required  to  send  or  receive  a 
bit  can  be  derived.  The  results  of  these  network  bench¬ 
marks  appear  in  Figure  3  and  are  consistent  with  other 
studies  [20].  The  card  is  set  to  its  maximum  speed  of 
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1 1  Mb/s  and  two  tests  are  conducted.  In  the  first,  the 
Skiff  communicates  with  a  wireless  card  mere  inches 
away  and  achieves  5.70Mb/sec.  In  the  second,  the  sec¬ 
ond  node  is  placed  as  far  from  the  Skiff  as  possible  with¬ 
out  losing  packets.  Only  2.85  Mb/sec  is  achieved.  These 
two  cases  bound  the  performance  of  our  1 1  Mb/sec  wire¬ 
less  card;  typical  performance  should  be  somewhere  be¬ 
tween  them. 


Figure  3.  Measured  communication  energy  of 
Enterasys  wireless  NIC 


Next,  a  microbenchmark  is  used  to  determine  the  min¬ 
imum  energy  for  an  ADD  instruction.  We  use  Linux  boot 
code  to  bootstrap  the  processor;  select  a  cache  configu¬ 
ration;  and  launch  assembly  code  unencumbered  by  an 
operating  system.  One  thousand  ADD  instructions  are 
followed  by  an  unconditional  branch  which  repeats  them. 
This  code  was  chosen  and  written  in  assembly  language 
to  minimize  effects  of  the  branch.  Once  the  program  has 
been  loaded  into  instruction  cache,  the  energy  used  by 
the  processor  for  a  single  add  is  0.86  nJ. 

From  these  initial  network  and  ADD  measure¬ 
ments,  we  can  conclude  that  sending  a  single  bit  is 
roughly  equivalent  to  performing  485-1267  ADD  op¬ 
erations  depending  on  the  quality  of  the  network  link 

(asexm-^j  «  485  or  «  1267).  This  gap  of 

2-3  orders  of  magnitude  suggests  that  much  additional 
effort  can  be  spent  trying  to  reduce  a  file’s  size  before  it 
is  sent  or  received.  But  the  issue  is  not  so  simple. 


3.2  Application-Level  Communication-to- 
Computation  Energy  Ratio 


On  the  Skiff  platform,  memory,  peripherals,  and  the 
network  card  remain  powered  on  even  when  they  are 
not  active,  consuming  a  fixed  energy  overhead.  They 
may  even  switch  when  not  in  use  in  response  to  changes 
on  shared  buses.  The  energy  used  by  these  compo¬ 
nents  during  the  ADD  loop  is  significant  and  is  shown 


in  Table  2.  Once  a  task-switching  operating  system  is 
loaded  and  other  applications  vie  for  processing  time, 
the  communication-to-computation  energy  ratio  will  de¬ 
crease  further.  Finally,  the  applications  examined  in  this 
paper  are  more  than  a  mere  series  of  ADDs;  the  variety 
of  instructions  (especially  Loads  and  Stores)  in  compres¬ 
sion  applications  shrinks  the  ratio  further. 


Network  card 

0.43  nJ 

CPU 

0.86  nJ 

Mem 

1.10  nJ 

Periph 

4.20  nJ 

Total 

6.59  nJ 

Table  2.  Total  Energy  of  an  ADD 

The  first  row  of  Figures  4  and  5  show  the  energy  re¬ 
quired  to  compress  our  text  and  web  dataset  and  transmit 
it  via  wireless  ethernet.  To  avoid  punishing  the  bench¬ 
marks  for  the  Skiff’s  high  power,  idle  energy  has  been 
removed  from  the  peripheral  component  so  that  it  repre¬ 
sents  only  the  amount  of  additional  energy  (due  to  bus 
toggling  and  arbitration  effects)  over  and  above  the  en¬ 
ergy  that  would  have  been  consumed  by  the  peripherals 
remaining  idle  for  the  duration  of  the  application.  Idle 
energy  is  not  removed  from  the  memory  and  CPU  por¬ 
tions  as  they  are  required  to  be  active  for  the  duration  of 
the  application.  The  network  is  assumed  to  consume  no 
power  until  it  is  turned  on  to  send  or  receive  data.  The 
popular  compression  applications  discussed  in  Section 
2.2  are  used  with  their  default  parameters,  and  the  right¬ 
most  bar  shows  the  energy  of  merely  copying  the  uncom¬ 
pressed  data  over  the  network.  Along  with  energy  due  to 
default  operation  (labeled  “bzip2-900,”  “compress- 16,” 
“lzo-16,”  “ppmd- 10240,”  and  “zlib-6”),  the  figures  in¬ 
clude  energy  for  several  invocations  of  each  application 
with  varying  parameters.  bzip2  is  run  with  both  the  de¬ 
fault  900  KB  block  sizes  as  well  as  its  smallest  100  KB 
block,  compress  is  also  run  at  both  ends  of  its  spectrum 
(12  bit  and  16  bit  maximum  codeword  size).  LZO  runs 
in  just  16  KB  of  working  memory.  PPMd  uses  10  MB, 
1  MB,  and  32  KB  memory  with  the  cutoff  mechanism  for 
freeing  space  (as  it  is  faster  than  the  default  “restart”  in 
low-memory  configurations).  zUb  is  run  in  a  configura¬ 
tion  similar  to  gzip.  The  numeric  suffix  (9,  6,  or  1)  refers 
to  effort  level  and  is  analogous  to  gzip’s  commandline 
option.  These  various  invocations  will  be  studied  in  sec¬ 
tion  3.3.3. 

While  most  compressors  do  well  with  the  web  data,  in 
several  cases  the  energy  to  compress  the  file  approaches 
or  outweighs  the  energy  to  transmit  it.  This  problem  is 
even  worse  for  the  harder-to-compress  text  data.  The  sec¬ 
ond  row  of  Figures  4  and  5  shows  the  reverse  operation; 
receiving  data  via  wireless  ethernet  and  decompressing 
it.  The  decompression  operation  is  usually  less  costly 
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Application  Application 


Receive  +  Decompress  (2.85Mb/sec) 


Application 


Receive  +  Decompress  (5.70Mb/sec) 


Application 


Figure  4.  Energy  required  to  transmit  1MB  compressible  text  data 


Compress  +  Send  (2.85Mb/sec)  Compress  +  Send  Energy  (5.70Mb/sec) 


Application  Application 


Receive  +  Decompress  (2.85Mb/sec)  Receive  Decompress  (5.70Mb/sec) 


Application  Application 


Figure  5.  Energy  required  to  transmit  1MB  compressible  web  data 
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than  compression  in  terms  of  energy,  a  fact  which  will  be 
helpful  in  choosing  a  low-energy,  asymmetric,  lossless 
compression  scheme.  As  an  aside,  we  have  seen  that  as 
transmission  speed  increases,  the  value  of  reducing  wire¬ 
less  energy  through  data  compression  is  less.  Thus,  even 
when  compressing  and  sending  data  appears  to  require 
the  same  energy  as  sending  uncompressed  data,  it  is  ben¬ 
eficial  to  apply  compression  for  the  greater  good;  more 
shared  bandwidth  will  be  available  to  all  devices  allow¬ 
ing  them  to  send  data  faster  and  with  less  energy.  Section 
3.3  will  discuss  how  such  high  net  energy  is  possible  de¬ 
spite  the  motivating  observations. 

3.3  Energy  analysis  of  popular  compressors 

We  will  look  deeper  into  the  applications  to  discover 
why  they  cannot  exploit  the  communication  -  computa¬ 
tion  energy  gap.  To  perform  this  analysis,  we  rely  on  em¬ 
pirical  observations  on  the  Skiff  platform  as  well  as  the 
execution-driven  simulator  known  as  SimpleScalar  [7]. 
Though  SimpleScalar  is  inherently  an  out-of-order,  su¬ 
perscalar  simulator,  it  has  been  modified  to  read  statically 
linked  ARM  binaries  and  model  the  five-stage,  in-order 
pipeline  of  the  S A- 11  Ox  [2].  As  SimpleScalar  is  beta 
software  we  will  handle  the  statistics  it  reports  with  cau¬ 
tion,  using  them  to  explain  the  traits  of  the  compression 
applications  rather  than  to  describe  their  precise  execu¬ 
tion  on  a  Skiff.  Namely,  high  instruction  counts  and  high 
cost  of  memory  access  lead  to  poor  energy  efficiency. 

3.3.1  Instruction  count 

We  begin  by  looking  at  the  number  of  instructions  each 
requires  to  remove  and  restore  a  bit  (Table  3).  The  range 
of  instruction  counts  is  one  empirical  indication  of  the 
applications’  varying  complexity.  The  excellent  perfor¬ 
mance  of  LZO  is  due  in  part  to  its  implementation  as 
a  single  function,  thus  there  is  no  function  call  over¬ 
head.  In  addition,  LZO  avoids  superfluous  copying  due 
to  buffering  (in  contrast  with  compress  and  zlib).  As  we 
will  see,  the  number  of  memory  accesses  plays  a  large 
role  in  determining  the  speed  and  energy  of  an  applica¬ 
tion.  Each  program  contains  roughly  the  same  percent¬ 
age  of  loads  and  stores,  but  the  great  difference  in  dy¬ 
namic  number  of  instructions  means  that  programs  such 
as  bzip2  and  PPMd  (each  executing  over  1  billion  in¬ 
structions)  execute  more  total  instructions  and  therefore 
have  the  most  memory  traffic. 

3.3.2  Memory  hierarchy 

One  noticeable  similarity  of  the  bars  in  Figures  4  and  5  is 
that  the  memory  requires  more  energy  than  the  processor. 
To  pinpoint  the  reason  for  this,  microbenchmarks  were 
run  on  the  Skiff  memory  system. 


The  SA-110  data  cache  is  16  KB.  It  has  32-way  as¬ 
sociativity  and  16  sets.  Each  block  is  32  bytes.  Data  is 
evicted  at  half-block  granularity  and  moves  to  a  16  entry- 
by-16  byte  write  buffer.  The  write  buffer  also  collects 
stores  that  miss  in  the  cache  (the  cache  is  writeback/non- 
write-allocate).  The  store  buffer  can  merge  stores  to  the 
same  entry. 

The  hit  benchmark  accesses  the  same  location  in 
memory  in  an  infinite  loop.  The  miss  benchmark  consec¬ 
utively  accesses  the  entire  cache  with  a  32  byte  stride  fol¬ 
lowed  by  the  same  access  pattern  offset  by  16  KB.  Write¬ 
backs  are  measured  with  a  similar  pattern,  but  each  load 
is  followed  by  a  store  to  the  same  location  that  dirties  the 
block  forcing  a  writeback  the  next  time  that  location  is 
read.  Store  hit  energy  is  subtracted  from  the  writeback 
energy.  The  output  of  the  compiler  is  examined  to  in¬ 
sure  the  correct  number  of  load  or  store  instructions  is 
generated.  Address  generation  instructions  are  ignored 
for  miss  benchmarks  as  their  energy  is  minimal  com¬ 
pared  to  that  of  a  memory  access.  When  measuring  store 
misses  in  this  fashion  (with  a  32  byte  stride),  the  worse- 
case  behavior  of  the  SA-1  lO’s  store  buffer  is  exposed  as 
no  writes  can  be  combined.  In  the  best  case,  misses  to 
the  the  same  buffered  region  can  have  energy  similar  to 
a  store  hit,  but  in  practice,  the  majority  of  store  misses 
for  the  compression  applications  are  unable  to  take  ad¬ 
vantage  of  batching  writes  in  the  store  buffer. 

Table  4  shows  that  hitting  in  the  cache  requires  more 
energy  than  an  ADD  (Table  2),  and  a  cache  miss  requires 
up  to  145  times  the  energy  of  an  ADD.  Store  misses  are 
less  expensive  as  the  SA-1 10  has  a  store  buffer  to  batch 
accesses  to  memory.  To  minimize  energy,  then,  we  must 
seek  to  minimize  cache-misses  which  require  prolonged 
access  to  higher  voltage  components. 

3.3.3  Minimizing  memory  access  energy 

One  way  to  minimize  misses  is  to  reduce  the  memory  re¬ 
quirements  of  the  application.  Figure  6  shows  the  effect 
of  varying  memory  size  on  compression/decompression 
time  and  compression  ratio.  Looking  back  at  Figures  4 
and  5,  we  see  the  energy  implications  of  choosing  the 
right  amount  of  memory.  Most  importantly,  we  see  that 
merely  choosing  the  fastest  or  best-compressing  appli¬ 
cation  does  not  result  in  lowest  overall  energy.  Table  5 
notes  the  throughput  of  each  application;  we  see  that  with 
the  Skiff’s  processor,  several  applications  have  difficulty 
meeting  the  line  rate  of  the  network  which  may  preclude 
their  use  in  latency-critical  applications. 

In  the  case  of  compress  and  bzip2,  a  larger  memory 
footprint  stores  more  information  about  the  data  and  can 
be  used  to  improve  compression  ratio.  However,  storing 
more  information  means  less  of  the  data  fits  in  the  cache 
leading  to  more  misses,  longer  runtime  and  hence  more 
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bzip2 

compress 

LZO 

PPMd 

zlib 

Compress:  instructions  per  bit  removed  (Text  Data) 

116 

10 

7 

76 

74 

Decompress:  instructions  per  bit  restored  (Text  Data  ) 

31 

6 

2 

10 

5 

Compress:  instructions  per  bit  removed  (Web  Data) 

284 

9 

2 

60 

23 

Decompress:  instructions  per  bit  restored  (Web  Data  ) 

20 

5 

1 

79 

3 

Table  3.  Instructions  per  bit 


Observed  data  compression  performance 


Observed  data  decompression  performance 
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Figure  6.  Memory,  time,  and  ratio  (Text  data).  Memory  footprint  is  indicated  by  area  of  circle;  footprints  shown 
range  from  3KB  -  SMB 


Cycles 

Energy  (nj) 

Load  Hit 

1 

2.72 

Load  Miss 

80 

124.89 

Writeback 

107 

180.53 

Store  Hit 

1 

2.41 

Store  Miss 

33 

78.34 

ADD 

1 

0.86 

Table  4.  Measured  memory  energy  vs.  ADD  energy 

energy.  This  tradeoff  need  not  apply  in  the  case  where 
more  memory  allows  a  more  efficient  data  structure  or 
algorithm.  For  example,  bzip2  uses  a  large  amount  of 
memory,  but  for  good  reason.  While  we  were  able  to 
implement  its  sort  with  the  quicksort  routine  from  the 
standard  C  library  to  save  significant  memory,  the  com¬ 
pression  takes  over  2.5  times  as  long  due  to  large  con¬ 
stants  in  the  runtime  of  the  more  traditional  quicksort  in 
the  standard  library.  This  slowdown  occurs  even  when 
16  KB  block  sizes  [38]  are  used  to  further  reduce  mem¬ 
ory  requirements.  Once  PPMd  has  enough  memory  to 
do  useful  work,  more  context  information  can  be  stored 
and  less  complicated  escape  handling  is  necessary. 

The  widely  scattered  performance  of  zUb,  even  with 
similar  footprints,  suggest  that  one  must  be  careful  in 


choosing  parameters  for  this  library  to  achieve  the  de¬ 
sired  goal  (speed  or  compression  ratio).  Increasing  win¬ 
dow  size  effects  compression;  for  a  given  window,  a 
larger  hash  table  improves  speed.  Thus,  the  net  ef¬ 
fect  of  more  memory  is  variable.  The  choice  is  espe¬ 
cially  important  if  memory  is  constrained  as  certain  win¬ 
dow/memory  combinations  are  inefficient  for  a  particular 
speed  or  ratio. 

The  decompression  side  of  the  figure  underscores  the 
valuable  asymmetry  of  some  of  the  applications.  Often 
decompressing  data  is  a  simpler  operation  than  compres¬ 
sion  which  requires  less  memory  (as  in  bzip2  and  zlib). 
The  simple  task  requires  a  relatively  constant  amount  of 
time  as  there  is  less  work  to  do:  no  sorting  for  bzip2 
and  no  searching  though  a  history  buffer  for  zlib,  LZO, 
and  compress  because  all  the  information  to  decompress 
a  file  is  explicit.  The  contrast  between  compression  and 
decompression  for  zlib  is  especially  large.  PPM  imple¬ 
mentations  must  go  through  the  same  procedure  to  de¬ 
compress  a  file,  undoing  the  arithmetic  coding  and  build¬ 
ing  a  model  to  keep  its  probability  counts  in  sync  with 
the  compressor’s.  The  arithmetic  coder/decoder  used  in 
PPMd  requires  more  time  to  decode  than  encode,  so  de¬ 
compression  requires  more  time. 

Each  of  the  applications  examined  allocates  fixed-size 
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bzip2 

compress 

LZO 

PPMd 

zlib 

Compress  read  throughput  (Text  data) 

0.91 

3.70 

24.22 

1.57 

0.82 

Decompress  write  throughput  (Text  data) 

2.59 

11.65 

109.44 

1.42 

41.15 

Compress  read  throughput  (Web  data) 

0.58 

4.15 

50.05 

2.00 

3.29 

Decompress  write  throughput  (Web  data) 

3.25 

27.43 

150.70 

1.75 

61.29 

Table  5.  Application  throughputs  (Mb/sec) 


structures  regardless  of  the  input  data  length.  Thus,  in 
several  cases  more  memory  is  set  aside  than  is  actually 
required.  However,  a  large  memory  footprint  may  not 
be  detrimental  to  an  application  if  its  current  working  set 
fits  in  the  cache.  The  simulator  was  used  to  gather  cache 
statistics.  PPM  and  BWT  are  known  to  be  quite  mem¬ 
ory  intensive.  Indeed,  PPMd  and  bzip2  access  the  data 
cache  1-2  orders  of  magnitude  more  often  than  the  other 
benchmarks,  zlib  accesses  data  cache  almost  as  much  as 
PPMd  and  bzip2  during  compression,  but  drops  from  150 
million  accesses  to  8.2  million  during  decompression. 
Though  LZ77  is  local  by  nature,  the  large  window  and 
data  structures  hurt  its  cache  performance  for  zUb  during 
the  compression  phase.  LZO  also  uses  ULll ,  but  is  de¬ 
signed  to  require  just  16KB  of  memory  and  goes  to  main 
memory  over  five  times  less  often  than  the  next  fastest 
application.  The  followup  to  the  SA-110  (the  SA-1110 
used  in  Compaq’s  iPAQ  handheld  computer)  has  only  an 
8KB  data  cache  which  would  exaggerate  any  penalties 
observed  here.  Though  large,  low-power  caches  are  be¬ 
coming  possible  (the  X-Scale  has  two  32KB  caches),  as 
long  as  the  energy  of  going  to  main  memory  remains  so 
much  higher,  we  must  be  concerned  with  cache  misses. 

3.4  Summary 

On  the  Skiff,  compression  and  decompression  energy 
are  roughly  proportional  to  execution  time.  We  have  seen 
that  the  Skiff  requires  lots  of  energy  to  work  with  ag¬ 
gressively  compressed  data  due  to  the  amount  of  high- 
latency/high-power  memory  references.  However  using 
the  fastest-running  compressor  or  decompressor  is  not 
necessarily  the  best  choice  to  minimize  total  transmis¬ 
sion  energy.  For  example,  during  decompression  both 
zlib  and  compress  run  slower  than  LZO,  but  they  re¬ 
ceive  fewer  bits  due  to  better  compression  so  total  en¬ 
ergy  is  less  than  LZO.  These  applications  successfully 
walk  the  tightrope  of  computation  versus  communication 
cost.  Despite  the  greater  energy  needed  to  decompress 
the  data,  the  decrease  in  receive  energy  makes  the  net 
operation  a  win.  More  importantly,  we  have  shown  that 
reducing  energy  is  not  as  simple  as  choosing  the  fastest 
or  best-compressing  program. 

We  can  generalize  the  results  obtained  on  the  Skiff  in 
the  following  fashion.  Memory  energy  is  some  multiple 


of  CPU  energy.  Network  energy  (send  and  receive)  is  a 
far  greater  multiple  of  CPU  energy.  It  is  difficult  to  pre¬ 
dict  how  quickly  energy  of  components  will  change  over 
time.  Even  predicting  whether  a  certain  component’s  en¬ 
ergy  usage  will  grow  or  shrink  can  be  difficult.  Many 
researchers  envision  ad-hoc  networks  made  of  nearby 
nodes.  Such  a  topology,  in  which  only  short-distance 
wireless  communication  is  necessary,  could  reduce  the 
energy  of  the  network  interface  relative  to  the  CPU  and 
memory.  On  the  other  hand,  for  a  given  mobile  CPU  de¬ 
sign,  planned  manufacturing  improvements  may  lower 
its  relative  power  and  energy.  Processors  once  used  only 
in  desktop  computers  are  being  recast  as  mobile  proces¬ 
sors.  Though  their  power  may  be  much  larger  than  that 
of  the  Skiff’s  StrongARM,  higher  clock  speeds  may  re¬ 
duce  energy.  If  one  subscribes  to  the  belief  that  CPU  en¬ 
ergy  will  steadily  decrease  while  memory  and  network 
energy  remain  constant,  then  bzip2  and  PPMd  become 
viable  compressors.  If  both  memory  and  CPU  energy  de¬ 
crease,  then  current  low-energy  compression  tools  {com¬ 
press  and  LZO)  can  even  be  surpassed  by  their  compu¬ 
tation  and  memory  intensive  peers.  However,  if  only 
network  energy  decreases  while  the  CPU  and  memory 
systems  remain  static,  energy-conscious  systems  may 
forego  compression  altogether  as  it  now  requires  more 
energy  than  transmitting  raw  data.  Thus,  it  is  important 
for  software  developers  to  be  aware  of  such  hardware 
effects  if  they  wish  to  keep  compression  energy  as  low 
as  possible.  Awareness  of  the  type  of  data  to  be  trans¬ 
mitted  is  important  as  well.  For  example,  transmitting 
our  world-wide-web  data  required  less  energy  in  general 
than  the  text  data.  Trying  to  compress  pre-compressed 
data  (not  shown)  requires  significantly  more  energy  and 
is  usually  futile. 

4  Results 

We  have  seen  energy  can  be  saved  by  compress¬ 
ing  files  before  transmitting  them  over  the  network,  but 
one  must  be  mindful  of  the  energy  required  to  do  so. 
Compression  and  decompression  energy  may  be  mini¬ 
mized  through  wise  use  of  memory  (including  efficient 
data  structures  and/or  sacrificing  compression  ratio  for 
cacheability).  One  must  be  aware  of  evolving  hardware’s 
effect  on  overall  energy.  Finally,  knowledge  of  com- 
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pression  and  decompression  energy  for  a  given  system 
permits  the  use  of  asymmetric  compression  in  which  the 
lowest  energy  application  for  compression  is  paired  with 
the  lowest  energy  application  for  decompression. 

4,1  Understanding  cache  behavior 

Figure  7  shows  the  compression  energy  of  several 
successive  optimizations  of  the  compress  program.  The 
baseline  implementation  is  itself  an  optimization  of  the 
original  compress  code.  The  number  preceding  the  dash 
refers  to  the  maximum  length  of  codewords.  The  graph 
illustrates  the  need  to  be  aware  of  the  cache  behavior  of 
an  application  in  order  to  minimize  energy.  The  data 
structure  of  compress  consists  of  two  arrays:  a  hash  ta¬ 
ble  to  store  symbols  and  prefixes,  and  a  code  table  to 
associate  codes  with  hash  table  indexes.  The  tables  are 
initially  stored  back-to-back  in  memory.  When  a  new 
symbol  is  read  from  the  input,  a  single  index  is  used  to 
retrieve  corresponding  entries  from  each  array.  The  “16- 
merge”  version  combines  the  two  tables  to  form  an  array 
of  structs.  Thus,  the  entry  from  the  code  table  is  brought 
into  the  cache  when  the  hash  entry  is  read.  The  reduction 
in  energy  is  negligible:  though  one  type  of  miss  has  been 
eliminated,  the  program  is  actually  dominated  by  a  sec¬ 
ond  type  of  miss:  the  probing  of  the  hash  table  for  free 
entries.  The  Skiff  data  cache  is  small  (16KB)  compared 
to  the  size  of  the  hash  table  (ft!270KB),  thus  the  random 
indexing  into  the  hash  table  results  in  a  large  number 
of  misses.  A  more  useful  energy  and  performance  opti¬ 
mization  is  to  make  the  hash  table  more  sparse.  This  ad¬ 
mits  fewer  collisions  which  results  in  fewer  probes  and 
thus  a  smaller  number  of  cache  misses.  As  long  as  the 
extra  memory  is  available  to  enable  this  optimization, 
about  0.53  Joules  are  saved  compared  with  applying  no 
compression  at  all.  This  is  shown  by  the  ‘T6-sparse”  bar 
in  the  figure.  The  baseline  and  “16-merge”  implemen¬ 
tations  require  more  energy  than  sending  uncompressed 
data.  A  12-bit  version  of  compress  is  shown  as  well. 
Even  when  peripheral  overhead  energy  is  disregarded, 
it  outperforms  or  ties  the  16-bit  schemes  as  its  reduced 
memory  energy  due  to  fewer  misses  makes  up  for  poorer 
compression. 

Another  way  to  reduce  cache  misses  is  to  fit  both  ta¬ 
bles  completely  in  the  cache.  Compare  the  following  two 
structures: 

struct  entry!  struct  entry! 

int  fcode;  signed  fcode:20; 

unsigned  short  code;  unsigned  code:12; 
liable [SIZE] ;  [table [SIZE] ; 
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Figure  7.  Optimizing  compress  (Text  data) 


types  result  in  twelve  wasted  bits  in  fcode  and  four  bits 
wasted  in  code.  Using  bitfields,  the  layout  on  the  right 
contains  the  same  information  yet  fits  in  half  the  space. 
If  the  entry  were  not  four  bytes,  it  would  need  to  con¬ 
tain  more  members  for  alignment.  Code  with  such  struc¬ 
tures  would  become  more  complex  as  C  does  not  support 
arrays  of  bitfields,  but  unless  the  additional  code  intro¬ 
duces  significant  instruction  cache  misses,  the  change  is 
low-impact.  A  bitwise  AND  and  a  shift  are  all  that  is 
needed  to  determine  the  offset  into  the  compact  struc¬ 
ture.  By  allowing  the  whole  table  to  fit  in  the  cache,  the 
program  with  the  compacted  array  has  just  56,985  data 
cache  misses  compared  with  734,195  in  the  un-packed 
structure;  a  0.0026%  miss  rate  versus  0.0288%.  The 
energy  benefit  for  compress  with  the  compact  layout  is 
negligible  because  there  is  so  little  CPU  and  memory  en¬ 
ergy  to  eliminate  by  this  technique.  The  “11 -merge”  and 
“11 -compact”  bars  illustrate  the  similarity.  Nevertheless, 
1 1 -compact  runs  1.5  times  faster  due  to  the  reduction  in 
cache  misses,  and  such  a  strategy  could  be  applied  to 
any  program  which  needs  to  reduce  cache  misses  for  per¬ 
formance  and/or  energy.  Eleven  bit  codes  are  necessary 
even  with  the  compact  layout  in  order  to  reduce  the  size 
of  the  data  structure.  Despite  a  dictionary  with  half  the 
size,  the  number  of  bytes  to  transmit  increases  by  just 
18%  compared  to  ‘T2-merge.”  Energy,  however,  is  lower 
with  the  smaller  dictionary  due  to  less  energy  spent  in 
memory  and  increased  speeds  which  reduce  peripheral 
overhead. 

4.2  Exploiting  the  sleep  mode 


Each  entry  stores  the  same  information,  but  the  ar¬ 
ray  on  the  left  wastes  four  bytes  per  entry.  Two  bytes 
are  used  only  to  align  the  short  code,  and  overly- wide 


It  has  been  noted  that  when  a  platform  has  a  low- 
power  idle  state,  it  may  be  sensible  to  sacrifice  energy 
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in  the  short-term  in  order  to  complete  an  application 
quickly  and  enter  the  low-power  idle  state  [26] .  Figure 
8  shows  the  effect  of  this  analysis  for  compression  and 
sending  of  text.  Receive/decompression  exhibits  simi¬ 
lar,  but  less-pronounced  variation  for  different  idle  pow¬ 
ers.  It  is  interesting  to  note  that,  assuming  a  low-power 
idle  mode  can  be  entered  once  compression  is  complete, 
one’s  choice  of  compression  strategies  will  vary.  With  its 
1  Watt  of  idle  power,  the  Skiff  would  benefit  most  from 
zlib  compression.  A  device  which  used  negligible  power 
when  idle  would  choose  the  LZO  compressor.  While 
LZO  does  not  compress  data  the  most,  it  allows  the  sys¬ 
tem  to  drop  into  low-power  mode  as  quickly  as  possible, 
using  less  energy  when  long  idle  times  exist.  For  web 
data  (not  shown  due  to  space  constraints)  the  compres¬ 
sion  choice  is  LZO  when  idle  power  is  low.  When  idle 
power  is  one  Watt,  bzip2  energy  is  over  25%  more  energy 
efficient  than  the  next  best  compressor. 


Total  Energy  Consumed  in  15  Seconds 


Figure  8.  Compression  -i-  Send  energy  consumption 
with  varying  sleep  power  (Text  data) 


4.3  Asymmetric  compression 

Consider  a  wireless  client  similar  to  the  Skiff  ex¬ 
changing  English  text  with  a  server.  All  requests  by  the 
client  should  be  made  with  its  minimal-energy  compres¬ 
sor,  and  all  responses  by  the  server  should  be  compressed 
in  such  a  way  that  they  require  minimal  decompression 
energy  at  the  client.  Recalling  Figures  4  and  5,  and  rec¬ 
ognizing  that  the  Skiff  has  no  low-power  sleep  mode,  we 
choose  “compress- 12”  (the  twelve-bit  codeword  LZW 
compressor)  for  our  text  compressor  as  it  provides  the 
lowest  total  compression  energy  over  all  communication 
speeds. 

To  reduce  decompression  energy,  the  client  can  re¬ 


quest  data  from  the  server  in  a  format  which  facilitates 
low-energy  decompression.  If  latency  is  not  critical  and 
the  client  has  a  low-power  sleep  mode,  it  can  even  wait 
while  the  server  converts  data  from  one  compressed  for¬ 
mat  to  another.  On  the  Skiff,  zlib  is  the  lowest  energy 
decompressor  for  both  text  and  web  data.  It  exhibits  the 
property  that  regardless  of  the  effort  and  memory  param¬ 
eters  used  to  compress  data,  the  resulting  file  is  quite  easy 
to  decompress.  The  decompression  energy  difference  be¬ 
tween  compress,  LZO,  and  zlib  is  minor  at  5.70 Mb/sec, 
but  more  noticeable  at  slower  speeds. 

Figure  9  shows  several  other  combinations  of  com¬ 
pressor  and  decompressor  at  5.70  Mb/sec.  “zlib-9  -i-  zlib- 
9”  represents  the  symmetric  pair  with  the  least  decom¬ 
pression  energy,  but  its  high  compression  energy  makes 
it  unlikely  to  be  used  as  a  compressor  for  devices  which 
must  limit  energy  usage,  “compress-12  -i-  compress-12” 
represents  the  symmetric  pair  with  the  least  compres¬ 
sion  energy.  If  symmetric  compression  and  decompres¬ 
sion  is  desired,  then  this  “old-fashioned”  Unix  compress 
program  can  be  quite  valuable.  Choosing  “zlib-1”  at 
both  ends  makes  sense  as  well  -  especially  for  programs 
linked  with  the  zlib  library.  Compared  with  the  minimum 
symmetric  compressor-decompressor,  asymmetric  com¬ 
pression  on  the  Skiff  saves  only  11%  of  energy.  How¬ 
ever,  modern  applications  such  as  ssh  and  mod^zip  use 
“zlib-6”  at  both  ends  of  the  connection.  Compared  to 
this  common  scheme,  the  optimal  asymmetric  pair  yields 
a  57%  energy  savings  -  mostly  while  performing  com¬ 
pression. 

It  is  more  difficult  to  realize  a  savings  over  symmet¬ 
ric  zlib-6  for  web  data  as  all  compressors  do  a  good  job 
compressing  it  and  “zlib-6”  is  already  quite  fast.  Nev¬ 
ertheless,  by  pairing  “Izo”  and  “zlib-9,”  we  save  12%  of 
energy  over  symmetric  “Izo”  and  31%  over  symmetric 
“zlib-6.” 

5  Related  work 

This  section  discusses  data  compression  for  low- 
bandwidth  devices  and  optimizing  algorithms  for  low 
energy.  Though  much  work  has  gone  into  these  fields 
individually,  it  is  difficult  to  find  any  which  combines 
them  to  examine  lossless  data  compression  from  an  en¬ 
ergy  standpoint.  Computation-to-communication  energy 
ratio  has  been  been  examined  before  [12],  but  this  work 
adds  physical  energy  measurements  and  applies  the  re¬ 
sults  to  lossless  data  compression. 

5.1  Lossless  Data  compression  for 
low-bandwidth  devices 

Like  any  optimization,  compression  can  be  applied  at 
many  points  in  the  hardware-software  spectrum.  When 
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Energy  to  Send  and  Receive 
a  compressable  1  MB  file 


Combination:  Compressor  Decompressor 


Figure  9.  Choosing  an  optimal  compressor- 
decompressor  pair 


applied  in  hardware,  the  benefits  and  costs  propagate  to 
all  aspects  of  the  system.  Compression  in  software  may 
have  a  more  dramatic  effect,  but  for  better  or  worse,  its 
effects  will  be  less  global. 

The  introduction  of  low-power,  portable,  low- 
bandwidth  devices  has  brought  about  new  (or  rediscov¬ 
ered)  uses  for  data  compression.  Van  Jacobson  intro¬ 
duced  TCP/IP  Header  Compression  in  RFC  1144  to  im¬ 
prove  interactive  performance  over  low-speed  (wired)  se¬ 
rial  links  [19],  but  it  is  equally  applicable  to  wireless.  By 
taking  advantage  of  uniform  header  structure  and  self¬ 
similarity  over  the  course  of  a  particular  networked  con¬ 
versation,  40  byte  headers  can  be  compressed  to  3-5 
bytes.  Three  byte  headers  are  the  common  case.  An 
all-purpose  header  compression  scheme  (not  confined 
to  TCP/IP  or  any  particular  protocol)  appears  in  [24]. 
TCP/IP  payloads  can  be  compressed  as  well  with  IP- 
Comp  [39],  but  this  can  be  wasted  effort  if  data  has  al¬ 
ready  been  compressed  at  the  application  layer. 

The  Low-Bandwidth  File  System  (LBFS)  exploits 
similarities  between  the  data  stored  on  a  client  and  server, 
to  exchange  only  data  blocks  which  differ  [31].  Files 
are  divided  into  blocks  with  content-based  fingerprint 
hashes.  Blocks  can  match  any  file  in  the  file  system 
or  the  client  cache;  if  client  and  server  have  match¬ 
ing  block  hashes,  the  data  itself  need  not  be  transmit¬ 
ted.  Compression  is  applied  before  the  data  is  transmit¬ 
ted.  Rsync  [44]  is  a  protocol  for  efficient  file  transfer 
which  preceded  LBFS.  Rather  than  content-based  finger¬ 
prints,  Rsync  uses  its  rolling  hash  function  to  account  for 


changes  in  block  size.  Block  hashes  are  compared  for  a 
pair  of  files  to  quickly  identify  similarities  between  client 
and  server.  Rsync  block  sharing  is  limited  to  files  of  the 
same  name. 

A  protocol-independent  scheme  for  text  compression, 
NCTCSys,  is  presented  in  [30].  NCTCSys  involves  a 
common  dictionary  shared  between  client  and  server. 
The  scheme  chooses  the  best  compression  method  it  has 
available  (or  none  at  all)  for  a  dataset  based  on  parame¬ 
ters  such  as  file  size,  line  speed,  and  available  bandwidth. 

Along  with  remote  proxy  servers  which  may  cache  or 
reformat  data  for  mobile  clients,  splitting  the  proxy  be¬ 
tween  client  and  server  has  been  proposed  to  implement 
certain  types  of  network  traffic  reduction  for  HTTP  trans¬ 
actions  [14,  23].  Because  the  delay  required  for  manip¬ 
ulating  data  can  be  small  in  comparison  with  the  latency 
of  the  wireless  link,  bandwidth  can  be  saved  with  little 
effect  on  user  experience.  Alternatively,  compression 
can  be  built  into  servers  and  clients  as  in  the  mod^zip 
module  available  for  the  Apache  Webserver  and  HTTP 

1.1  compliant  browsers  [16].  Delta  encoding,  the  trans¬ 
mission  of  only  parts  of  documents  which  differ  between 
client  and  server,  can  also  be  used  to  compress  network 
traffic  [15,  27,  28,  35]. 

5.2  Optimizing  algorithms  for  low  energy 

Advanced  RISC  Machines  (ARM)  provides  an  appli¬ 
cation  note  which  explains  how  to  write  C  code  in  a  man¬ 
ner  best-suited  for  its  processors  and  ISA  [1].  Sugges¬ 
tions  include  rewriting  code  to  avoid  software  emulation 
and  working  with  32  bit  quantities  whenever  possible  to 
avoid  a  sign-extension  penalty  incurred  when  manipu¬ 
lating  shorter  quantities.  To  reduce  energy  consump¬ 
tion  and  improve  performance,  the  OptAlg  tool  repre¬ 
sents  polynomials  in  a  manner  most  efficient  for  a  given 
architecture  [34].  As  an  example,  cosine  may  be  ex¬ 
pressed  using  two  MAC  instructions  and  an  MUL  to  ap¬ 
ply  a  “Horner  transform”  on  a  Taylor  Series  rather  than 
making  three  calls  to  a  cosine  library  function. 

Besides  architectural  constraints,  high  level  languages 
such  as  C  may  introduce  false  dependencies  which  can 
be  removed  by  disciplined  programmers.  For  instance, 
the  use  of  a  global  variable  implies  loads  and  stores 
which  can  often  be  eliminated  through  the  use  of  register- 
allocated  local  variables.  Both  types  of  optimizations  are 
used  as  guidelines  by  PHiPAC  [6],  an  automated  gener¬ 
ator  of  optimized  libraries.  In  addition  to  these  general 
coding  rules,  architectural  parameters  are  provided  to  a 
code  generator  by  search  scripts  which  work  to  find  the 
best-performing  routine  for  a  given  platform. 

Yang  et  al.  measured  the  power  and  energy  impact  of 
various  compiler  optimizations,  and  reached  the  conclu¬ 
sion  that  energy  can  be  saved  if  the  compiler  can  reduce 
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execution  time  and  memory  references  [48].  Simunic 
found  that  floating  point  emulation  requires  much  energy 
due  to  the  sheer  number  of  extra  instructions  required 
[46].  It  was  also  discovered  that  instruction  flow  opti¬ 
mizations  (such  as  loop  merging,  unrolling,  and  software 
pipelining)  and  ISA  specific  optimizations  (e.g.,  the  use 
of  a  multiply-accumulate  instruction)  were  not  applied 
by  the  ARM  compiler  and  had  to  be  introduced  manually. 
Writing  such  energy-efficient  source  code  saves  more  en¬ 
ergy  than  traditional  compiler  speed  optimizations  [45]. 

The  CMU  Odyssey  project  studied  “application- 
aware  adaptation”  to  deal  with  the  varying,  often  lim¬ 
ited  resources  available  to  mobile  clients.  Odyssey  trades 
data  quality  for  resource  consumption  as  directed  by  the 
operating  system.  By  placing  the  operating  system  in 
charge,  Odyssey  balances  the  needs  of  all  running  ap¬ 
plications  and  makes  the  choice  best  suited  for  the  sys¬ 
tem.  Application-specific  adaptation  continues  to  im¬ 
prove.  When  working  with  a  variation  of  the  Discrete 
Cosine  Transform  and  computing  first  with  DC  and  low- 
frequency  components,  an  image  may  be  rendered  at 
90%  quality  using  just  25%  of  its  energy  budget  [41]. 
Similar  results  are  shown  for  FIR  filters  and  beamform¬ 
ing  using  a  most-significant-first  transform.  Parameters 
used  by  JPEG  lossy  image  compression  can  be  varied  to 
reduce  bandwidth  requirements  and  energy  consumption 
for  particular  image  quality  requirements  [43].  Research 
to  date  has  focused  on  situations  where  energy-fidelity 
tradeoffs  are  available.  Lossless  compression  does  not 
present  this  luxury  because  the  original  bits  must  be  com¬ 
municated  in  their  entirety  and  re-assembled  in  order  at 
the  receiver. 

6  Conclusion  and  Future  Work 

The  value  of  this  research  is  not  merely  to  show  that 
one  can  optimize  a  given  algorithm  to  achieve  a  cer¬ 
tain  reduction  in  energy,  but  to  show  that  the  choice  of 
how  and  whether  to  compress  is  not  obvious.  It  is  de¬ 
pendent  on  hardware  factors  such  as  relative  energy  of 
CPU,  memory,  and  network,  as  well  as  software  factors 
including  compression  ratio  and  memory  access  patterns. 
These  factors  can  change,  so  techniques  for  lossless  com¬ 
pression  prior  to  transmission/reception  of  data  must  be 
re-evaluated  with  each  new  generation  of  hardware  and 
software.  On  our  StrongARM  computing  platform,  mea¬ 
suring  these  factors  allows  an  energy  savings  of  up  to 
57%  compared  with  a  popular  default  compressor  and 
decompressor.  Compression  and  decompression  often 
have  different  energy  requirements.  When  one’s  usage 
supports  the  use  of  asymmetric  compression  and  decom¬ 
pression,  up  to  12%  of  energy  can  be  saved  compared 
with  a  system  using  a  single  optimized  application  for 
both  compression  and  decompression. 


When  looking  at  an  entire  system  of  wireless  devices, 
it  may  be  reasonable  to  allow  some  to  individually  use 
more  energy  in  order  to  minimize  the  total  energy  used 
by  the  collection.  Designing  a  low-overhead  method  for 
devices  to  cooperate  in  this  manner  would  be  a  worth¬ 
while  endeavor.  To  facilitate  such  dynamic  energy  ad¬ 
justment,  we  are  working  on  EProf:  a  portable,  realtime, 
energy  profiler  which  plugs  into  the  PC-Card  socket  of 
a  portable  device  [22].  EProf  could  be  used  to  create 
feedback-driven  compression  software  which  dynami¬ 
cally  tunes  its  parameters  or  choice  of  algorithms  based 
on  the  measured  energy  of  a  system. 
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ABSTRACT 

A  new  dynamic  cache  resizing  scheme  for  low-power  CAM- 
tag  caches  is  introduced.  A  control  algorithm  that  is  only 
activated  on  cache  misses  uses  a  duplicate  set  of  tags,  the 
miss  tags,  to  minimize  active  cache  size  while  sustaining 
close  to  the  same  hit  rate  as  a  full  size  cache.  The  cache 
partitioning  mechanism  saves  both  switching  and  leakage 
energy  in  unused  partitions  with  little  impact  on  cycle  time. 
Simulation  results  show  that  the  scheme  saves  28-56%  of 
data  cache  energy  and  34-49%  of  instruction  cache  energy 
with  minimal  performance  impact. 

Categories  and  Subject  Descriptors 

B.3.2  [Memory  Structures]:  Design  Styles — Associative 
Memory,  Cache  Memory,  Primary  Memory 

General  Terms 

Design 

Keywords 

Content-Addressable-Memory,  Low-Power,  Cache  Resizing, 
Energy  Efficiency,  Leakage  Current 

1.  INTRODUCTION 

Energy  dissipation  has  emerged  as  one  of  the  primary  con¬ 
straints  for  microprocessor  designers.  In  most  microproces¬ 
sor  designs,  caches  dissipate  a  significant  fraction  of  total 
power.  For  example,  the  Alpha  21264  dissipates  16%  [12] 
and  the  StrongArm  dissipates  more  than  43%  ]19]  of  overall 
power  in  caches.  As  a  result,  there  has  been  great  interest 
in  reducing  cache  power  consumption. 

Initial  cache  energy  reduction  techniques  focused  on  dy¬ 
namic  switching  power  ]1,  2,  3,  4,  7,  10,  13,  22].  With 
technology  scaling,  leakage  current  is  increasing  exponen¬ 
tially,  and  more  attention  has  been  paid  to  leakage  power 
reduction  [9,  11,  15,  16,  18,  20]. 
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One  approach  for  reducing  cache  power  consumption  is 
cache  resizing,  where  the  active  size  of  the  cache  is  reduced 
to  match  the  current  working  set.  Previously  reported  cache 
resizing  schemes  can  be  categorized  by  the  mechanism  used 
to  activate  and  deactivate  cache  entries,  and  by  the  control 
policy  used  to  select  the  active  partition.  Some  schemes 
deactivate  cache  entries  line  by  line  [9,  11],  while  others  de¬ 
activate  the  cache  by  sets,  ways,  or  both  ]1,  16,  20].  The 
control  policy  used  to  select  the  active  set  can  be  off-line, 
where  the  working  set  is  statically  determined  by  profiling 
the  application  ]1],  or  on-line,  where  the  working  set  is  dy¬ 
namically  determined  as  the  application  executes  ]9,  11,  16, 
20]. 

Previous  cache  resizing  techniques  are  designed  for  RAM- 
tag  caches,  where  cache  tags  are  held  in  RAM  structures. 
However,  commercial  low-power  microprocessors  use  CAM- 
tag  caches,  where  the  cache  tags  are  held  in  Content  Ad¬ 
dressable  Memory  [14,  19].  CAM-tag  caches  are  popular  in 
low-power  processors  because  they  provide  high  associativ¬ 
ity,  which  avoids  expensive  cache  misses,  and  results  in  lower 
overall  energy  [23]. 

This  paper  introduces  miss  tag  resizing  (MTR),  a  new 
cache  resizing  scheme  for  CAM-tag  caches.  MTR  uses  hi¬ 
erarchical  bitlines  to  divide  each  cache  subbank  into  small 
way  partitions,  such  that  switching  and  leakage  power  is 
only  dissipated  in  active  ways.  In  addition,  individual  cache 
lines  within  an  active  partition  can  be  disabled  to  further 
reduce  leakage  power.  Because  CAM-tag  caches  have  high 
associativity  (32-way  for  the  design  simulated),  partition¬ 
ing  the  cache  by  way  gives  much  finer  grain  control  over 
cache  size  compared  to  RAM-tag  way  activation  [1].  It  also 
avoids  the  data  remapping  problem  inherent  in  set  resizing 
schemes  [16].  In  addition,  the  scheme  proposed  here  adapts 
associativity  independently  in  each  sub-bank,  thereby  al¬ 
lowing  total  cache  size  to  be  varied  a  single  line  at  a  time. 
Resizing  of  different  subbanks  is  spaced  evenly  in  time  so 
that  at  most  a  single  dirty  line  needs  to  be  written  back  for 
a  resize  event. 

The  size  of  an  MTR  cache  is  governed  using  an  on-line 
control  policy  which  aims  to  reduce  the  cache  size  to  the 
smallest  value  that  will  give  a  minimal  miss  rate  increase 
compared  to  the  full  sized  cache.  The  control  policy  uses  an 
extra  set  of  tags,  the  miss  tags,  which  are  only  accessed  on 
misses  to  determine  if  a  full-sized  cache  would  have  hit.  Be¬ 
cause  the  miss  tags  are  only  accessed  on  misses,  they  add  no 
additional  switching  energy  to  hits  and  can  be  implemented 
using  slower,  denser,  and  less  leaky  transistors,  e.g.,  high  Vt 
or  long  channel  transistors.  The  main  penalty  for  using  miss 
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Figure  1:  CAM-tag  cache  organization. 


cache_access (action,  addr_tag,  addr_offset,  data)  { 
if  (addr_tag  in  tag_array)  {  /*  hit  case  */ 

if  (action  ==  Read)  ■( 

return  data_array [addr_tag,  addr_offset] ; 
y  else  { 

data_array [addr_tag,  addr_offset]  =  data; 

} 

return  hit ; 

}  else  {  /*  miss  case  */ 

/*  fetch  data  from  L2  and  update  the  cache  */ 
f etch_from_memory(addr_tag,  addr_offset) ; 

/*  check  whether  tag  is  in  MTR  tag  array  */ 
if  (addr_tag  in  MTR_tag_array)  { 

/*  if  tag  is  found  in  MTR,  */ 

/*  increment  MTR  hit  counter  */ 

MTR_hits++; 
y  else  { 

/*  otherwise,  write  the  tag  into  MTR  array  */ 
update_MTR_tag_content (addr_tag) ; 

} 

return  miss; 

} 

> 


tags  is  the  additional  area  overhead,  which  we  estimate  at 
around  10%  depending  on  actual  layout  styles. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2 
reviews  related  work  on  cache  resizing.  Section  3  presents 
the  MTR  algorithm.  Section  4  describes  the  hardware  mod¬ 
ifications  for  energy  reduction.  Section  5  gives  results  for 
active  cache  size  reductions.  Section  6  presents  the  energy 
savings  achieved  by  MTR.  And  Section  7  concludes. 

2.  RELATED  WORK 

In  this  section,  we  discuss  existing  cache  resizing  tech¬ 
niques  and  cache  line  deactivation  techniques.  An  off-line 
resizing  technique  was  proposed  in  [1].  Applications  are 
profiled  prior  to  execution  to  determine  an  optimal  set- 
associativity.  At  run-time,  cache  ways  of  the  LI  RAM-tag 
set-associative  cache  are  turned  off  according  to  the  pro¬ 
file  information.  This  technique  reduces  both  switching  and 
leakage  energy  by  powering  down  the  entire  cache  way.  How¬ 
ever,  it  does  not  adapt  to  varying  cache  usage  during  differ¬ 
ent  phases  of  the  program  execution.  As  we  will  show  later, 
many  benchmarks  have  working  sets  that  vary  widely  during 
various  phases  of  execution.  Furthermore,  these  static  tech¬ 
niques  do  not  work  well  for  multi-programmed  machines, 
where  working  set  size  also  varies  as  a  function  of  the  active 
process.  The  DRI  I-cache  [16]  is  an  on-line  resizing  tech¬ 
nique  that  resizes  a  RAM-tag  instruction  cache  by  measur¬ 
ing  the  miss  rate  and  keeping  it  under  a  preset  threshold. 
This  performance  threshold  is  set  to  a  typical  cache  miss 
rate  prior  to  execution,  which  does  not  adapt  to  program 
execution  phases.  Line  deactivation  techniques  are  similar 
to  the  above  resizing  techniques.  These  techniques  usually 
turn  off  individual  cache  lines  that  are  not  necessarily  con¬ 
tiguous.  In  cache  decay  [11],  a  per-line  counter  tracks  the 
usage  of  each  cache  line.  Lines  with  no  recent  uses  are  turned 
off.  This  technique  eliminates  the  static  energy  of  dead  lines 
but  does  not  reduce  switching  energy.  Adaptive  mode  con¬ 
trol  (AMC)  [9]  resizes  a  RAM-tag  cache  using  a  technique 
similar  to  cache  decay.  AMC  keeps  all  tags  turned  on.  An 
ideal  miss  rate  is  obtained  by  searching  the  entire  tag  ar¬ 
ray,  and  an  actual  miss  rate  is  obtained  by  only  searching 
the  tags  of  all  the  active  lines.  When  these  two  miss  rates 
differ  by  more  than  a  preset  performance  factor,  the  resize 


cache_resizeO  { 

if  (MTR_hits  >  HI_B0UND)  { 
upsize  0 ; 

}  else  if  (MTR_hits  <  L0_B0UND)  { 
downsize  O ; 

}  else  { 

do_nothlngO  ; 

} 

/*  reset  the  MTR  hit  counter  for  */ 
/*  next  resizing  interval  */ 
MTR_hlts  =  0; 

} 


Figure  2:  Pseudo-code  for  MTR. 


interval  is  adjusted.  This  technique  eliminates  the  need  for 
presetting  the  desired  miss  rates,  but  only  reduces  leakage 
power  in  the  data  arrays.  Tag  array  lookup,  however,  is  a 
significant  portion  of  the  cache  access  energy,  especially  for 
CAM-tag  caches.  In  [20],  various  design  choices  are  com¬ 
pared  to  evaluate  the  usefulness  of  resizable  caches.  On  av¬ 
erage,  over  50%  cache  size  reduction  is  achieved  with  either 
selective  ways  [1]  or  selective  sets  [16].  Turning  off  por¬ 
tions  of  the  cache  generally  discards  the  stored  data,  thus 
increasing  miss  rate  and  the  number  of  L2  accesses.  In  [8], 
the  effect  of  L2  energy  overhead  is  examined.  Our  MTR 
scheme  is  similar  to  AMC  in  that  we  resize  based  on  the 
difference  between  the  full  cache  hit  rate  and  the  reduced 
cache  hit  rate.  However,  we  employ  a  separate  set  of  tags 
that  are  only  accessed  on  misses  to  gather  the  full  cache  hit 
rate.  This  avoids  additional  switching  and  leakage  power  in 
the  regular  CAM  tags.  Also,  we  use  the  miss  rate  differ¬ 
ence  to  control  a  fine-grain  partitionable  cache  which  can 
save  switching  as  well  as  leakage  power.  Another  problem 
with  previous  partitioning  schemes  is  that  when  applied  to 
a  data  cache,  they  can  generate  a  large  number  of  dirty  line 
writebacks  in  a  short  time  interval  when  a  set  or  way  is 
deactivated,  or  when  a  decay  interval  elapses.  These  write 
back  bursts  add  to  cache  control  complexity  and  can  cause 
additional  performance  degradation.  MTR  performs  way 
deactivation  within  a  highly  associative  cache  one  line  at  a 
time,  thus  avoids  write  back  bursts. 
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3.  MISS-TAG  RESIZING  TECHNIQUE 

Figure  1  shows  a  typical  CAM-tag  cache  organization. 
The  entire  cache  is  divided  into  subbanks,  each  consisting 
of  a  tag  array  and  a  data  array,  where  a  subbank  is  a  cache 
set.  Within  each  set  are  the  cache  ways.  The  tags  are  stored 
in  CAM  structures  to  give  high  associativity  at  low  power. 
During  each  cache  access,  one  subbank  (set)  of  the  cache  is 
accessed  and  the  tag  is  broadcast  to  the  entire  tag  array. 
A  matched  tag  results  in  a  hit  and  triggers  the  appropriate 
wordline  to  enable  the  access. 

To  implement  MTR,  we  add  an  extra  set  of  tags,  the  miss- 
tags,  which  act  as  the  tags  of  a  fixed-size  cache.  These  tags 
keep  track  of  what  the  cache  contents  would  have  been  if 
the  cache  was  always  full  size.  During  a  regular  cache  miss, 
we  consult  the  miss-tag  arrays  to  see  whether  having  a  full 
cache  could  have  avoided  the  miss.  A  per-subbank  counter 
is  used  to  record  the  number  of  miss-tag  hits,  which  is  pre¬ 
cisely  the  difference  between  the  number  of  misses  in  the 
down-sized  cache  and  in  a  full  size  cache.  A  large  differ¬ 
ence  in  the  miss  rates  suggests  that  having  a  larger  cache 
will  reduce  the  miss  rate;  a  small  difference  indicates  that 
perhaps  a  smaller  cache  would  be  adequate.  Two  scenarios 
could  explain  a  small  difference  in  miss  rate  between  the  full 
size  and  reduced  size  caches.  First,  there  are  no  misses  in 
the  regular  tags,  indicating  that  the  program  has  a  small 
working  set.  In  the  second  scenario,  there  are  many  misses 
in  the  regular  tags,  most  of  which  also  miss  in  the  miss  tags. 
This  suggests  that  the  program  has  little  temporal  locality, 
such  as  a  data  streaming  application. 

The  resizing  decision  is  based  on  the  difference  in  miss 
rates  between  the  active  tags  and  the  miss  tags.  The  pseudo¬ 
code  in  Figure  2  illustrates  the  resizing  control  loop  of  MTR. 
There  are  three  parameters  in  the  MTR  scheme:  miss  lower 
bound,  miss  upper  bound,  and  resize  interval.  In  Section  5.2, 
we  will  discuss  the  choices  of  resizing  parameters  in  detail. 
Each  subbank  is  independently  resized  once  during  each  re¬ 
sizing  interval.  Resizing  events  are  spread  out  evenly  within 
each  interval  so  that  only  one  subbank  resizes  at  a  time  to 
minimize  writeback  traffic  burst  to  the  lower  levels  of  the 
memory  hierarchy. 

4.  HARDWARE  MODIFICATION 

Figure  3  details  three  circuit  techniques  used  by  MTR. 
For  the  SRAM  cells  in  both  data  and  tag  arrays,  we  use 
the  Gated-Vdd  technique  [15]  to  reduce  leakage  energy  by 
adding  an  N-type  stack  transistor.  When  signal  Line_0n 
is  turned  off,  it  virtually  eliminates  leakage  current  in  the 
SRAM  cells.  We  also  use  the  leakage-biased  bitline  (LBB) 
technique  proposed  in  [17]  to  reduce  the  leakage  in  SRAM 
bitlines,  CAM  bitlines  and  search  lines,  and  CAM  match 
lines.  The  leakage  power  of  the  circuit  depends  on  the  actual 
voltage  of  these  heavily  capacitive  lines.  The  LBB  technique 
turns  off  the  precharge  of  these  lines,  allowing  them  to  self¬ 
bias  their  voltage  levels  to  the  optimal  values,  at  which  leak¬ 
age  power  is  minimized  using  leakage  currents.  The  cache 
subbanks  are  divided  into  eight  equal  partitions  using  hi¬ 
erarchical  bitlines  [7].  The  Partition_Dn  bits  are  used  to 
control  the  activation  of  each  partition.  An  inactive  par¬ 
tition  consumes  no  switching  energy  and  minimal  leakage 
energy. 

Since  the  miss-tags  are  only  used  during  a  cache  miss, 
we  can  use  slow,  low-leakage  components  without  incurring 


Figure  3:  Energy  reduction  techniques  used  by 
MTR:  Gated-Vdd  for  SRAM  cell  leakage  reduction; 
Leakage-Bias  for  CAM  match  line;  hierarchical  bit¬ 
lines  for  subbank  partitioning. 

delay  overhead.  The  energy  overhead  of  miss-tag  accesses 
is  added  to  L2  access  energy  and  is  discussed  in  Section  6. 
The  area  overhead  can  be  reduced  by  using  a  denser  layout 
for  the  tags,  for  example,  adopting  a  hybrid  RAM-CAM 
structure  to  reduce  the  number  of  match  comparators. 

5.  CACHE  SIZE  REDUCTION  RESUUTS 

In  order  to  evaluate  MTR,  we  modified  the  SimpleScalar  [5] 
simulator.  We  modeled  an  in-order  single  issue  core  in  our 
experiments.  The  benchmark  set  we  used  is  a  subset  of 
SpecINT2000  and  SpecFP2000,  each  running  for  1.5  billion 
cycles  with  the  reference  inputs.  We  chose  a  typical  low- 
power  cache  configuration  [14]  as  a  baseline.  It  is  a  32KB 
cache  implemented  in  32  1KB  subbanks.  Each  subbank  con¬ 
sists  of  32  cache  lines  of  32  bytes.  The  cache  is  32-way  set- 
associative  with  a  FIFO  replacement  policy  in  each  subbank. 

One  unary  encoded  resizing  pointer  per  subbank  is  used  to 
control  which  cache  lines  to  activate/deactivate,  similar  to 
the  XScale  FIFO  pointer  [14].  When  a  cache  is  downsized, 
only  the  last  active  line  is  turned  off.  When  it  is  upsized, 
however,  the  entire  partition  where  the  last  active  line  re¬ 
sides  is  turned  on.  If  all  the  lines  in  the  entire  partition  are 
already  active,  the  next  partition  is  turned  on.  When  all 
the  lines  in  a  partition  are  inactive,  the  partition  is  turned 
off.  To  avoid  thrashing  with  small  cache  sizes,  we  set  the 
minimum  cache  size  to  be  one  partition. 

5.1  Baseline  Case 

We  implemented  a  baseline  resizing  technique  to  compare 
against  the  miss  tags  scheme.  This  baseline  technique  works 
exactly  like  MTR  except  it  compares  the  actual  cache  miss 
rate  with  the  miss  bounds  to  make  resizing  decisions,  similar 
to  DRI  I-cache  [16].  We  will  refer  to  this  baseline  technique 
as  Miss-Rate-Based- Resizing  (MRBR). 

5.2  Impact  of  Resizing  Parameters 

From  simulation  results,  we  found  that  no  individual  pa¬ 
rameter  has  a  large  impact  on  resizing  performance.  The 
most  important  parameter,  rather,  is  the  ratio  of  the  miss 
upper/lower  bounds  to  the  resize  interval.  For  example,  set¬ 
ting  the  miss  bound  of  5  to  10  misses  for  a  32k  resizing  in¬ 
terval  yields  similar  results  for  a  range  of  10  to  20  misses  for 
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Figure  4:  CPI  versus  effective  cache  size  for  LI 
data  cache.  MTR  gives  the  smallest  effective 
cache  size  for  a  given  CPI. 


Figure  5:  Miss  Rate  versus  effective  cache  size  for 
LI  data  cache.  MTR  gives  the  smallest  effective 
cache  size  for  a  given  miss  rate. 


a  64k  resizing  interval.  Simnlations  show  that  for  larger  re¬ 
size  intervals,  the  number  of  writebacks  decrease.  However, 
when  the  resize  interval  is  too  large,  MTR  starts  to  yield 
sub-optimal  results.  We  have  found  that  resize  intervals  of 
128K  references  worked  well  for  the  benchmarks  studied, 
i.e.,  resize  every  128k  memory  references. 

5.3  Data  Cache  Resizing  Results 

Figure  4  shows  the  resizing  results  for  the  LI  data  cache. 
Each  data  point  (effective  cache  size  and  CPI  pair)  is  ob¬ 
tained  by  varying  the  miss  bounds  and  resizing  interval 
length  to  obtain  the  optimal  CPI  for  a  given  effective  cache 
size.  Average  cache  size  is  calculated  by  averaging  the  per¬ 
centage  of  active  partitions  in  each  resizing  period.  In  order 
to  verify  that  both  resizing  techniques  work  better  than  a 
fixed-size  cache,  we  simulated  the  CPI  of  fixed-size  caches  of 
sizes  32KB,  16KB,  and  8KB.  This  figure  shows  that  for  the 
same  CPI,  MTR  yields  much  smaller  effective  cache  sizes. 
We  limited  ourselves  to  considering  configurations  that  yield 
less  than  a  2%  CPI  increase  to  ensure  MTR  does  not  incur  a 
large  performance  penalty.  Parameters  were  varied  to  show 
the  trade  off  between  effective  cache  size  and  performance. 
For  the  same  effective  cache  size,  MTR  performs  much  bet¬ 
ter  than  the  baseline  technique.  Figure  5  further  supports 
the  above  result.  MTR  introduces  less  than  a  16%  increase 
in  the  largest  fixed  cache  miss  rate.  Again,  for  the  same 
effective  cache  size,  MTR  has  the  lowest  miss  rate.  On  av¬ 
erage,  MTR  uses  less  than  an  8KB  effective  cache  size  while 
increasing  the  CPI  by  less  than  1.5%. 

Figure  6  shows  how  the  effective  cache  size  and  the  ac¬ 
tual  miss  rates  change  over  time  with  MTR.  The  figures  on 
the  left-hand  side  show  the  effective  cache  size  over  time. 
We  observe  two  different  behaviors.  Benchmarks  164.gzip, 
177. mesa,  183.equake,  197. parser,  and  256.bzip2  demon¬ 
strate  MTR’s  ability  to  adapt  to  different  phases  of  the  exe¬ 
cution  with  varying  cache  usage.  For  the  rest  of  the  bench¬ 
marks,  cache  usage  is  constant  throughout  the  execution. 
MTR  is  able  to  find  the  optimal  size  for  each  benchmark 
without  prior  profiling  information.  The  figures  on  the  right- 
hand-side  show  how  the  miss  rates  change  throughout  the 
execution.  We  observe  that  an  increase  in  the  miss  rate 
is  countered  by  an  increase  in  cache  size,  which  in  return, 
reduces  miss  rate. 


5.4  Instruction  Cache  Resizing  Results 

For  our  benchmark  set,  the  instruction  cache  has  extremely 
low  miss  rates.  Therefore,  it  is  easier  to  find  a  common 
reference  miss  rate  for  a  large  set  of  benchmarks.  For  all 
the  benchmarks  we  used  in  this  paper,  the  baseline  resiz¬ 
ing  technique  and  MTR  have  similar  performance.  Both  of 
them  outperform  the  fixed  size  instruction  cache.  Figures  9 
and  8  show  that  MTR  uses  an  effective  cache  size  of  less  than 
12KB  while  introducing,  on  average,  less  than  12%  increase 
in  miss  rate  and  1.4%  increase  in  CPI. 

6.  ENERGY  REDUCTION  RESULTS 

In  this  section,  we  present  the  energy  savings  obtained  by 
MTR.  The  energy  consumption  figures  are  obtained  through 
HSpice  simulation  of  extracted  layout  from  Cadence  [6]  us¬ 
ing  TSMC  0.25pm  technology  [21].  The  cache  design  has 
been  significantly  optimized  for  low  power,  including  divided 
word  lines  and  low-swing  bitlines.  Table  1  shows  the  differ¬ 
ent  energy  components  of  this  CAM-tag  cache.  MTR  re¬ 
duces  the  data  array  and  CAM-tag  array  access  energy  but 
not  decoding  energy.  Since  the  actual  percentage  of  cache 
leakage  power  in  the  total  cache  power  can  vary  significantly 
due  to  process  technology,  operating  temperatures  and  volt¬ 
ages,  among  other  factors,  we  quantify  cache  leakage  as  a 
percentage  of  total  cache  power,  and  demonstrate  the  sav¬ 
ings  across  a  range  of  possible  values.  We  perform  a  similar 
sensitivity  analysis  for  L2  cache  energy  by  quantifying  L2 
access  energy  as  a  multiple  of  LI  access  energy  and  give  re¬ 
sults  for  a  range  of  values.  We  include  the  search  energy  for 
the  miss-tags  as  part  of  L2  energy.  The  energy  reduction  is 
calculated  as 

LI  switching  energy  reduction  x  %  of  switching  energy 
-I-  LI  leakage  energy  reduction  x  %  of  leakage  energy 
—  Miss  Rate  Increase  x  L2  access  energy 

Figures  10  and  11  show  the  energy  reduction  of  data  and 
instruction  cache.  The  x-axis  represents  the  percentage  of 
leakage  energy  in  the  total  energy  consumption.  The  y-axis 
represents  the  energy  savings.  From  previous  experiments, 
we  use  resizing  parameters  such  that  the  effective  data  cache 
size  is  8KB  and  effective  instruction  cache  is  12KB.  These 
parameters  are  chosen  to  minimize  the  performance  impact 
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Figure  6:  Different  effective  cache  sizes  during 
different  phases  of  a  32KB  data  cache  determined 
by  MTR.  The  avaxis  represents  0  to  1.5  billion 
cycles. 
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Figure  7:  Cache  miss  rates  during  different 
phases  of  a  32KB  data  cache  determined  by  MTR. 
The  a!-axis  represents  0  to  1.5  billion  cycles. 
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Figure  8:  CPI  versus  effective  cache  size  for  LI  Figure  9:  Miss  rate  versus  effective  cache  size 

instruction  cache.  MTR  and  MRBR  have  similar  for  LI  instruction  cache.  MTR  and  MRBR  have 

performance.  similar  performance. 
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Figure  10:  Data  cache  energy  savings.  X-axis 
represent  the  percentage  of  leakage  energy  of  to¬ 
tal  energy.  Y-axis  represents  savings.  Each  curve 
represents  a  different  L2  access  energy  quantified 
as  a  factor  of  LI  write  access  energy. 
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Figure  11:  Instruction  cache  energy  savings.  X- 
axis  represent  the  percentage  of  leakage  energy 
of  total  energy.  Y-axis  represents  savings.  Each 
curve  represents  a  different  L2  access  energy 
quantified  as  a  factor  of  LI  write  access  energy. 
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Table  1:  Energy  components  of  CAM-tag  cache  in 
TSMC  0.25  pm  technology..  A  means  the  read  or 
write  access  performs  that  operation,  thus  uses  that  en¬ 
ergy  component. 


Operation 

Energy  (pj) 

Read 

Write 

CAM-Array  Search 

57.1 

V 

V 

Data- Array  Read 

26.2 

V 

Data- Array  Write 

53.5 

V 

Decoding  &  I/O 

12.2 

a/ 

V 

Total 

95.5  pJ 

122.8  pJ 

while  turning  off  the  maximnm  nnmber  of  partitions  in  the 
cache. 

Each  different  curve  represents  the  energy  savings  of  a 
specific  L2  access  energy.  We  chose  an  range  of  L2  access 
energy,  from  16  x  to  128  x  of  the  LI  write  access  energy.  For 
data  cache,  MTR  reduces  energy  by  28%,  when  there  is  no 
leakage  energy  and  L2  penalty  is  128 x  of  LI  write  access 
energy,  to  56%,  when  50%  of  the  cache  energy  is  leakage 
and  L2  penalty  is  16 x  of  LI  access  energy.  Similarly,  MTR 
reduction  ranges  from  34%  to  49%  for  instruction  cache  de¬ 
pending  on  leakage  percentage  and  L2  penalty. 

7.  CONCLUSION 

In  this  paper,  we  presented  MTR,  a  dynamic  cache  re¬ 
sizing  technique  for  CAM-tag  caches.  The  dynamic  control 
mechanism  of  MTR  uses  a  set  of  duplicate  miss  tags  to  keep 
track  of  the  miss  rate  as  if  the  entire  cache  was  used.  Re¬ 
sizing  decisions  are  made  according  to  the  difference  in  the 
actual  miss  rate  and  the  miss  rate  of  the  miss-tags.  The  con¬ 
trol  mechanism  is  only  activated  on  misses,  thereby  saving 
energy  and  allowing  the  duplicate  tags  to  be  implemented 
in  slower  and  denser  logic  using  low  leakage  transistors.  The 
cache  partitioning  mechanism  saves  both  switching  and  leak¬ 
age  energy  in  unused  partitions,  and  allows  resizing  at  a  sin¬ 
gle  line  granularity.  The  subbanks  are  resized  independently 
in  non-overlapping  phases  to  avoid  write  back  bursts.  With 
around  10%  area  overhead,  MTR  reduces  28-56%  of  data 
cache  energy  and  34-49%  of  instruction  cache  energy,  where 
the  baseline  caches  were  highly  optimized  for  low-power  but 
fixed-size  operation. 
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